Divide one column into several variables in r - r

I have one column with multiple sets of variables as listed below. I have to divide this column into several variables. For example, this column has information about country, tax number, name, address, date, etc. I have to create multiple variables - Country : mx, Tax Number : DID100927249, etc.- from this column but I am not sure how to do it. Does anyone know how I can do this?
{'country': ['mx'], 'taxNumber': ['DID100927249'], 'sourceUrl': ['https://sanctionssearch.ofac.treas.gov/Details.aspx?id=13009'], 'name': ['DISPOSITIVOS INDUSTRIALES DINAMICOS, S.A. DE C.V.'], 'alias': ['DISDA'], 'topics': ['sanction'], 'addressEntity': ['addr-f1636ec5b03213730a35200572c1273749727da4'], 'createdAt': ['2012-04-12']}
I tried the following code but in each column the order of properties is different so it is not working well :(
separate(df, col=properties, into=c('a', 'b','c','d','e'), sep=',')

Related

Fuzzy matching Countries in R

for an assignment I have to use fuzzy matching in R to merge two different datasets that both had a "Country" column. The first dataset is from Kaggle(Countries dataset) while the other is from ISO 3166 standard. I already use fuzzy matching it worked well. I add both data sets a new column that counts a number of observations(it is a must for fuzzy matching as far as I understand) 1 from their respectable lengths. That I named "Observation number" For my first dataset, there are 227 observations and for the ISO dataset, there are 249 observations.
I want to create a new dataset that includes columns from my first dataset(I had to use this data set specifically it has columns like migration, literacy, etc) and Country codes from the ISO dataset. I couldn't manage to do it. fuzzy matching output gave me how the first data set's observation numbers change in the ISO dataset. (For example in the first dataset countries ordered such as Afghanistan, Albania, Algeria.... whilst in ISO order in Albania, Algeria, Afghanistan) so for that fuzzy match output gave me 3,1,2... I understand this means 3rd observation in the ISO dataset is 1st in the Countries dataset.
I want to create a new data set that has all the information on the Countries datasets ordered withrespect to ISO datasets' Country columns' order.
However i cannot do it using
a=(Result1$matches)$observationnumber
#gives me vector a, where can I find i'th observation of Country dataset in ISO dataset
countryorderedlikeISO <- countries.of.the.world[match(c(a), countries.of.the.world$observation),]
It seems to ignore the countries that are present in ISO but not in the country dataset.
What can I do? I want this new dataset to be in ISO's length, with NA values for observations that are present in ISO but not in Country.

Convert comma separated column into multiple columns

I have a dataset of film with several columns, one of which is a column for country. Because some films are produced by more than one country, the film can have different countries at the same time in the "country" column. For example,
enter image description here
I now want to create a new dataset in which each row in “country” column can only has one country. For example, in the screenshot above, Bluebeard are produced by “France”, “Germany”, and “Italy” country. Right now, I want the dataset showing that Bluebeard is produced by “France”, “Germany”, and “Italy” country separately.
I tried strsplit()and colsplit() function, but that doesn’t seem to convert comma-separated "country" column into multiple columns that only contain one country each row.
Any suggestions?
Thank you!
Using tidyr:
separate_rows(data, country, sep = ", ")

How to merge two tables of different row numbers with approximate common values? (using R)

I am struggling to understand how to combine in R two tables when the common variables are not exactly similar.
To give the context, I have downloaded two sources of information about politicians, from Twitter and from the administration and created two different data frames. In the first data frame (dataset 1), I have the name of the politicians present on Twitter. However, I don’t know if these politicians are now in function or not. To discover that, I could use the second date frame.
The second data frame (dataset 2) contains the name and other information about the politicians now in function.
The first and last names are the only variables contained in both tables. The two tables do not have the same number of rows.
Problem:
The names in the first dataset were indicated as one variable (first name + last name) whereas in the second dataset the names were separated in two variables (last name and first name). I used separate to separate the name column in the first tables. parliament_twitter_tempdata <- separate(parliament_twitter_tempdata,col=name, into=c("firstname","lastname"),extra ="merge”).
However I have problems with it as both datasets have:
composed first names and composed last names
first name and last name in the incorrect order
I have included a picture of a part (from lastname "J" to "M") of both datasets to illustrate the difference between the similar values or the inversion of lastname, firstname.
How could I improve my code?
The names in both tables are not completely similar. Some people did not write the official name in Instagram. Is there any function which could compare the two tables, find the set of variables that correspond to around 80% and remplace the name in the data frame 1 (from Twitter) with the official name of data frame 2 ? Ex. Dataset 1 : Marie Gabour ; Dataset 2 : Marie Gabour Jolliet —> Replace the Marie Gabour from dataset 1 into Marie Gabour
Could someone help me there? Many thanks !
[Part of the dataset 1 after having separate (lastname from "J" to "M" )1 [Part of the name in dataset 2 (lastname from "J" to "M") 2
Fuzzy matching might be a way to move forward:
https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf
Also, cleaning functions may help (e.g., using toppper or removing whitespace on the key).

How to combine rows based off of duplicate values?

Basically what we have is several columns as follows:
Household ID, restaurantspend, groceryspend, foodtruckspend
We have duplicate household ids because each spend is in its own individual column so an example of our data looks like this:
data example
We want to have the Household ID only have 1 row per id and combine the numerical values of the other column.
aggdata = aggregate(mydata, by=list(mydata$HouseHoldID),Fun=sum)
I have created the above table and saved it as "Mydata". Run the above code. View the output "aggdata", you can see an extra column "Group.1" that's the group based on "HouseHoldID". You can ignore the second column "HouseHoldId" as the same information will be available in the column "Group.1".

Problems creating table from data in long format

I'll describe my data:
First column are corine_values, going from 1 to 50.
Second column are bird_names, there are 70 different bird_names, each corine_value has several bird_names.
Third column contains the sex of the bird_name.
Fourth column contains a V1-value (measurement) that belongs to the category described by the first three columns.
I want to create a table where the the row names are the bird_names. First all the females in alphabetical order, followed by the males in alphabetical order. The column names should be the corine_values, from small to big. The data in the table should be the corresponding V1-values.
I've been trying some things, but to be honest I'm just starting with R and I don't really have a clue how to do it. I can sort the data, but not on multiple levels (like alphabetical and sex combined). I'm exporting everything to Excel now and doing it manually, which is very time-consuming.

Resources