Fuzzy matching Countries in R - r

for an assignment I have to use fuzzy matching in R to merge two different datasets that both had a "Country" column. The first dataset is from Kaggle(Countries dataset) while the other is from ISO 3166 standard. I already use fuzzy matching it worked well. I add both data sets a new column that counts a number of observations(it is a must for fuzzy matching as far as I understand) 1 from their respectable lengths. That I named "Observation number" For my first dataset, there are 227 observations and for the ISO dataset, there are 249 observations.
I want to create a new dataset that includes columns from my first dataset(I had to use this data set specifically it has columns like migration, literacy, etc) and Country codes from the ISO dataset. I couldn't manage to do it. fuzzy matching output gave me how the first data set's observation numbers change in the ISO dataset. (For example in the first dataset countries ordered such as Afghanistan, Albania, Algeria.... whilst in ISO order in Albania, Algeria, Afghanistan) so for that fuzzy match output gave me 3,1,2... I understand this means 3rd observation in the ISO dataset is 1st in the Countries dataset.
I want to create a new data set that has all the information on the Countries datasets ordered withrespect to ISO datasets' Country columns' order.
However i cannot do it using
a=(Result1$matches)$observationnumber
#gives me vector a, where can I find i'th observation of Country dataset in ISO dataset
countryorderedlikeISO <- countries.of.the.world[match(c(a), countries.of.the.world$observation),]
It seems to ignore the countries that are present in ISO but not in the country dataset.
What can I do? I want this new dataset to be in ISO's length, with NA values for observations that are present in ISO but not in Country.

Related

replacing missing values in R with the one value that follows (not the mean)

I'm trying to replace the missing values in R with the value that follows, I have annual data for income by country, and for the missing income value for 2001 for country A I want it to pull the next value (this is for time series analysis with multiple different countries and different columns for different variables - income is just one of them)
I wrote this code for replacing the missing values with the mean, but statistically I think it makes more sense to replace the missing values with the value right below it (that comes next, the next year) since the numbers will be very different depending on the country so if I take an average it'll be of all years for all countries).
Social_data_R<-within(Social_data_R,incomeNAavg[is.na(income)]<-mean(income,na.rm=TRUE))
I tried replacing the mean part of the code above with income[i+1] but it didn't recognize 'i' (I uploaded the data from excel, so didn't create the dataframe manually)

In R - How to sum up the values in a column for equal names in another column?

I have an R dataset with different IATA Codes. Unfortunately, there are some equal IATA codes with different destination values (e.g. MCO).
How can I sum up the departure values per IATA Code, to finally have every IATA code only once?
I already tried to loop over it but don't know if that's the right approach ...

How to extract the unique values of a dataframe in comparison to others from an Upset plot? [R]

I have several data files and I showed their intersects with an upset plot. I now want to know what are the unique values in each dataset? For example, as in this picture, how can I extract the names/values of 232 sets of Thriller category?
I first used union to combine all my data into a single dataframe and then I used setdiff in setdiff(data1,all) to characterise the unique values, but nothing has shown up, while in my real upset plot, I have 10 values unique to my data1.
Thanks.

How to merge two tables of different row numbers with approximate common values? (using R)

I am struggling to understand how to combine in R two tables when the common variables are not exactly similar.
To give the context, I have downloaded two sources of information about politicians, from Twitter and from the administration and created two different data frames. In the first data frame (dataset 1), I have the name of the politicians present on Twitter. However, I don’t know if these politicians are now in function or not. To discover that, I could use the second date frame.
The second data frame (dataset 2) contains the name and other information about the politicians now in function.
The first and last names are the only variables contained in both tables. The two tables do not have the same number of rows.
Problem:
The names in the first dataset were indicated as one variable (first name + last name) whereas in the second dataset the names were separated in two variables (last name and first name). I used separate to separate the name column in the first tables. parliament_twitter_tempdata <- separate(parliament_twitter_tempdata,col=name, into=c("firstname","lastname"),extra ="merge”).
However I have problems with it as both datasets have:
composed first names and composed last names
first name and last name in the incorrect order
I have included a picture of a part (from lastname "J" to "M") of both datasets to illustrate the difference between the similar values or the inversion of lastname, firstname.
How could I improve my code?
The names in both tables are not completely similar. Some people did not write the official name in Instagram. Is there any function which could compare the two tables, find the set of variables that correspond to around 80% and remplace the name in the data frame 1 (from Twitter) with the official name of data frame 2 ? Ex. Dataset 1 : Marie Gabour ; Dataset 2 : Marie Gabour Jolliet —> Replace the Marie Gabour from dataset 1 into Marie Gabour
Could someone help me there? Many thanks !
[Part of the dataset 1 after having separate (lastname from "J" to "M" )1 [Part of the name in dataset 2 (lastname from "J" to "M") 2
Fuzzy matching might be a way to move forward:
https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf
Also, cleaning functions may help (e.g., using toppper or removing whitespace on the key).

Problems creating table from data in long format

I'll describe my data:
First column are corine_values, going from 1 to 50.
Second column are bird_names, there are 70 different bird_names, each corine_value has several bird_names.
Third column contains the sex of the bird_name.
Fourth column contains a V1-value (measurement) that belongs to the category described by the first three columns.
I want to create a table where the the row names are the bird_names. First all the females in alphabetical order, followed by the males in alphabetical order. The column names should be the corine_values, from small to big. The data in the table should be the corresponding V1-values.
I've been trying some things, but to be honest I'm just starting with R and I don't really have a clue how to do it. I can sort the data, but not on multiple levels (like alphabetical and sex combined). I'm exporting everything to Excel now and doing it manually, which is very time-consuming.

Resources