How do you match a numeric value to a categorical value in another data set - r

I have two data sets. One with a numeric value assigned to individual categorical variables (country name) and a second with survey responses including a person's nationality. How do I assign the numeric value to a new column in the survey dataset with matching nationality/country name?
Here is the head of data set 1 (my.data1):
EN HCI
1 South Korea 0.845
2 UK 0.781
3 USA 0.762
Here is the head of data set 2 (my.data2):
Nationality OIS IR
1 South Korea 2 2
2 South Korea 3 3
3 USA 3 4
4 UK 3 3
I would like to make it look like this:
Nationality OIS IR HCI
1 South Korea 2 2 0.845
2 South Korea 3 3 0.845
3 USA 3 4 0.762
4 UK 3 3 0.781
I have tried this but unsuccessfully:
my.data2$HCI <- NA
for (i in i:nrow(my.data2)) {
my.data2$HCI[i] <- my.data1$HCI[my.data1$EN == my.data2$Nationality[i]]
}

We can use a left_join
library(dplyr)
left_join(my.data2, my.data1, by = c("Nationality" = "EN"))
Or with merge from base R
merge(my.data2, my.data1, by.x = c("Nationality", by.y = "EN", all.x = TRUE)

Related

How to Merge Shapefile and Dataset?

I want to create a spatial map showing drug mortality rates by US county, but I'm having trouble merging the drug mortality dataset, crude_rate, with the shapefile, usa_county_df. Can anyone help out?
I've created a key variable, "County", in both sets to merge on but I don't know how to format them to make the data mergeable. How can I make the County variables correspond? Thank you!
head(crude_rate, 5)
Notes County County.Code Deaths Population Crude.Rate
1 Autauga County, AL 1001 74 975679 7.6
2 Baldwin County, AL 1003 440 3316841 13.3
3 Barbour County, AL 1005 16 524875 Unreliable
4 Bibb County, AL 1007 50 420148 11.9
5 Blount County, AL 1009 148 1055789 14.0
head(usa_county_df, 5)
long lat order hole piece id group County
1 -97.01952 42.00410 1 FALSE 1 0 0.1 1
2 -97.01952 42.00493 2 FALSE 1 0 0.1 2
3 -97.01953 42.00750 3 FALSE 1 0 0.1 3
4 -97.01953 42.00975 4 FALSE 1 0 0.1 4
5 -97.01953 42.00978 5 FALSE 1 0 0.1 5
crude_rate$County <- as.factor(crude_rate$County)
usa_county_df$County <- as.factor(usa_county_df$County)
merge(usa_county_df, crude_rate, "County")
[1] County long lat order hole
[6] piece id group Notes County.Code
[11] Deaths Population Crude.Rate
<0 rows> (or 0-length row.names)`
My take on this. First, you cannot expect a full answer with code because you did not provide a link to you data. Next time, please provide a full description of the problem with the data.
I just used the data you provided here to illustrate.
require(tidyverse)
# Load the data
crude_rate = read.csv("county_crude.csv", header = TRUE)
usa_county = read.csv("usa_county.csv", header = TRUE)
# Create the variable "county_join" within the county_crude to "left_join" on with the usa_county data. Note that you have to have the same type of data variable between the two tables and the same values as well
crude_rate = crude_rate %>%
mutate(county_join = c(1:5))
# Join the dataframes using a left join on the county_join and County variables
df_all = usa_county %>%
left_join(crude_rate, by = c("County"="county_join")) %>%
distinct(order,hole,piece,id,group, .keep_all = TRUE)
Data link: county_crude
Data link: usa_county
Blockquote

How can I alter the values of certain rows in a column, based on a condition from another column in a dataframe, using the ifelse function?

So I have this first dataframe (fish18) which consists of data on fish specimens, and a column "grade" that is to be filled with conditions in an ifelse function.
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo NA 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India NA 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa NA 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa NA 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa NA 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States NA 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
And after filling the grade column I have something like this (fish19)
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo D 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India A 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa C 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa A 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa E 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States B 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
Both dataframes have many specimens belonging to the same species of fish, and the thing is that the grades are suposed to be assigned to each species for every specimen of that species. The problem I'm having is that some rows belonging to the same species are having different grades, specially in the case of the grades "C" and "E". What I want to incorporate into my ifelse function is: Change from grade "C" to "E" every occurrence of the dataframe where two or more specimens belonging to the same species are assigned "C" in one row and "E" in another row. Because if one species has grade "E", every other row with that species name should also have grade "E".
So far I've tried the %in% function and just using "=="
Trying with %in%
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]%in%fish19$species[fish19$grade=="C"]==TRUE,"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Trying with "=="
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]==fish19$species[fish19$grade=="C"],"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Both these two options did not work and the output of this alteration should be that if one occurrence of a specific species name has the grade "E" assigned to it, so should all other occurences with that same species name.
I'm sorry if this was confusion but I tried to be as clear as I could, thank you in advance for any responses.
Kind of a long winded answer, but:
dat = data.frame('species'=c('a','b','c','a','a','b'),'grade'=c('E','E','C','C','C','D'))
dat %>% left_join(dat %>%
group_by(species) %>%
summarize(sum_e = sum(grade=='E')),by='species')
Then you could do an ifelse for sum_e>0

How do I add another column to a dataframe in R that shows the difference between the columns of two other dataframes?

What I have:
I have two dataframes to work with. Those are:
> print(myDF_2003)
A_score country B_score
1 200 Germany 11
2 150 Italy 9
3 0 Sweden 0
and:
> print(myDF_2005)
A_score country B_score
1 -300 France 16
2 100 Germany 12
3 200 Italy 15
4 40 Spain 17
They are produced by the following code, which I do not want to change:
#_________2003______________
myDF_2003=data.frame(c(200,150,0),c("Germany", "Italy", "Sweden"), c(11,9,0))
colnames(myDF_2003)=c("A_score","country", "B_score")
myDF_2003$country=as.character(myDF_2003$country)
myDF_2003$country=factor(myDF_2003$country, levels=unique(myDF_2003$country))
myDF_2003$A_score=as.numeric(as.character(myDF_2003$A_score))
myDF_2003$B_score=as.numeric(as.character(myDF_2003$B_score))
#_________2005______________
myDF_2005=data.frame(c(-300,100,200,40),c("France","Germany", "Italy", "Spain"), c(16,12,15,17))
colnames(myDF_2005)=c("A_score","country", "B_score")
myDF_2005$country=as.character(myDF_2005$country)
myDF_2005$country=factor(myDF_2005$country, levels=unique(myDF_2005$country))
myDF_2005$A_score=as.numeric(as.character(myDF_2005$A_score))
myDF_2005$B_score=as.numeric(as.character(myDF_2005$B_score))
What I want:
I want to paste another column to myDF_2005 which has the difference of the B_Scores of countries that exist in both previous dataframes. In other words: I want to produce this output:
> print(myDF_2005_2003_Diff)
A_score country B_score B_score_Diff
1 -300 France 16
2 100 Germany 12 1
3 200 Italy 15 6
4 40 Spain 17
Question:
What is the most elegant code to do this?
# join in a temporary dataframe
temp <- merge(myDF_2005, myDF_2003, by = "country", all.x = T)
# calculate the difference and assign a new column
myDF_2005$B_score_Diff <- temp$B_score.x - temp$B_score.y
A solution using dplyr. The idea is to merge the two data frame and then calculate the difference.
library(dplyr)
myDF_2005_2 <- myDF_2005 %>%
left_join(myDF_2003 %>% select(-A_score), by = "country") %>%
mutate(B_score_Diff = B_score.x - B_score.y) %>%
select(-B_score.y) %>%
rename(B_score = B_score.x)
myDF_2005_2
# A_score country B_score B_score_Diff
# 1 -300 France 16 NA
# 2 100 Germany 12 1
# 3 200 Italy 15 6
# 4 40 Spain 17 NA

R aggregating on date then character

I have a table that looks like the following:
Year Country Variable 1 Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 2 5
1971 UK 2 3
1971 UK 1 5
1971 USA 2 2
1972 USA 1 1
1972 USA 2 5
I'd be grateful if someone could tell me how I can aggregate the data to group it first by year, then country with the sum of variable 1 and variable 2 coming afterwards so the output would be:
Year Country Sum Variable 1 Sum Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 5 13
1971 USA 2 2
1972 USA 3 6
This is the code I've tried to no avail (the real dataframe is 125,000 rows by 30+ columns hence the subset. Please be kind, I'm new to R!)
#making subset from data
GT2 <- subset(GT1, select = c("iyear", "country_txt", "V1", "V2"))
#making sure data types are correct
GT2[,2]=as.character(GT2[,2])
GT2[,3] <- as.numeric(as.character( GT2[,3] ))
GT2[,4] <- as.numeric(as.character( GT2[,4] ))
#removing NA values
GT2Omit <- na.omit(GT2)
#trying to aggregate - i.e. group by year, then country with the sum of Variable 1 and Variable 2 being shown
aggGT2 <-aggregate(GT2Omit, by=list(GT2Omit$iyear, GT2Omit$country_txt), FUN=sum, na.rm=TRUE)
Your aggregate is almost correct:
> aggGT2 <-aggregate(GT2Omit[3:4], by=GT2Omit[c("country_txt", "iyear")], FUN=sum, na.rm=TRUE)
> aggGT2
country_txt iyear V1 V2
1 UK 1970 1 3
2 USA 1970 1 3
3 UK 1971 5 13
4 USA 1971 2 2
5 USA 1972 3 6
dplyr is almost always the answer nowadays.
library(dplyr)
aggGT1 <- GT1 %>% group_by(iyear, country_txt) %>% summarize(sv1=sum(V1), sv2=sum(V2))
Having said that, it is good to learn basic R functions like aggregate and by.

rworldmap package - Warning if the number of quantiles was reduced

I am using this R code:
library(rworldmap)
Data <- read.table("D:/Bla/Maps/Test.txt", header = TRUE, sep = "\t")
sPDF <- joinCountryData2Map(Data, joinCode = "ISO3",nameJoinColumn = "ISO3CountryCode")
mapCountryData(sPDF, nameColumnToPlot = "Data")
This produces a map but I get:
You asked for 7 quantiles, only 1 could be created in quantiles classification
I googled and it pointed me to this code
Not sure whether it is relevant.
This is the data I have used:
ISO3CountryCode Data
JPN 7
AUS 6
IND 6
CHN 5
GBR 5
CHE 4
IRN 4
DEU 3
EGY 3
ESP 3
LBY 3
TUN 3
USA 3
ARG 2
AUT 2
BRA 2
EST 2
GRC 2
ITA 2
TUR 2
URY 2
CHL 1
ETH 1
FRA 1
JOR 1
KEN 1
KOR 1
LTU 1
MEX 1
NLD 1
NZL 1
PER 1
POL 1
SAU 1
SRB 1
SVK 1
SVN 1
TZA 1
ZAF 1
It looks like by default mapCountryData() tries to fit data to quantiles for binning. You'll need to help it along a little by tweaking the catMethod parameter.
I'm not sure what your values 1 through 7 mean. If they are categories (and you want them all explicitly displayed in the legend), try:
mapCountryData(sPDF, nameColumnToPlot = "Data", catMethod="categorical")
If you want to treat all values equally on a continuous scale, try:
mapCountryData(sPDF, nameColumnToPlot = "Data", catMethod="fixedWidth")
If neither of these does do what you want, you might try altering numCats and/or catMethod see ?mapCountryData for the possible values and their meaning.

Resources