R Split strings into two columns based on TWO regular expressions - r

I have some data that is structured something like this:
ID Region Value
1 Europe 8
2 Europe: Class 1 6
3 Asia: System 2 6
4 North America 7
5 Europe: System 1 5
6 Africa 7
7 Africa: Class 2 5
8 South America 9
9 Europe: System 1 3
10 Europe 7
What I want to do is create a new column called Class which adds instances of where "Class" AND "System" are mentioned in the Region column - if it's not clear what I mean, take a look at my expected output below. I know this can be done with the separate function but I think you can only specify one value for the separator part of the code. E.g. sep = ": Class" will only split instances that mention "class" but I also want to split any instances where "system" is mentioned too. Can this be done in one line of code, or do I need to do something a bit more complicated here? Here's how my final data should look:
ID Region Class Value
1 Europe 8
2 Europe 1 6
3 Asia 2 6
4 North America 7
5 Europe 1 5
6 Africa 7
7 Africa 2 5
8 South America 9
9 Europe 1 3
10 Europe 7
Please note, I want to remove any reference to "class" or "system" (including colons) from the Region column, and simply add the numerical value to a new Class column.

You can do it with base functions by just using strsplit with a regular expression that takes either ": System" or ": Class" as symbol:
splitted = strsplit(df$Region,"(: Class)|(: System)")
df$Region = lapply(splitted,FUN=function(x){x[1]})
df$Class = lapply(splitted,FUN=function(x){x[2]})
The result is:
> df
ID Region Value Class
1 1 Europe 8 NA
2 2 Europe 6 1
3 3 Asia 6 2
4 4 North America 7 NA
5 5 Europe 5 1
6 6 Africa 7 NA
7 7 Africa 5 2
8 8 South America 9 NA
9 9 Europe 3 1
10 10 Europe 7 NA

You can use str_extract to extract the number and str_remove to drop the text that you don't want.
library(dplyr)
library(stringr)
df %>%
mutate(Class = str_extract(Region, '(?<=(Class|System)\\s)\\d+'),
Region = str_remove(Region, ':\\s*(Class|System)\\s*\\d+'))
# ID Region Value Class
#1 1 Europe 8 <NA>
#2 2 Europe 6 1
#3 3 Asia 6 2
#4 4 North America 7 <NA>
#5 5 Europe 5 1
#6 6 Africa 7 <NA>
#7 7 Africa 5 2
#8 8 South America 9 <NA>
#9 9 Europe 3 1
#10 10 Europe 7 <NA>
str_extract extracts the number which comes after 'Class'
or 'System'. If these words are not present then it returns NA.
str_remove removes colon followed by zero or more whitespace (\\s*) followed by either 'Class' or 'System' and a number (\\d+).
data
It is easier to help if you provide data in a reproducible format which is easier to copy.
df <- structure(list(ID = 1:10, Region = c("Europe", "Europe: Class 1",
"Asia: System 2", "North America", "Europe: System 1", "Africa",
"Africa: Class 2", "South America", "Europe: System 1", "Europe"
), Value = c(8L, 6L, 6L, 7L, 5L, 7L, 5L, 9L, 3L, 7L)),
class = "data.frame", row.names = c(NA, -10L))

Related

merging financial time series data

Given data frames the first column of which is the list of country names and is common in all data frames and the remainder columns are the years for which the value of the indicator is measured and these being the years is also common in data frames, what are the ways to merge the datasets by the first column? How to merge into a multidimensional array? dataset example:
country name
2005
....
2020
Aruba
23591
Angola
1902
country name
2005
....
2020
Aruba
-8.8
Angola
-3.5
Doing a full_join
library(dplyr)
full_join(DataSet1,DataSet2, by = 'country name')
changes the name of the columns and the data is not accessible.
1) Assuming the data frames in the Note at the end we can use bind_rows
library(dplyr)
bind_rows(DF1, DF2, .id = "id")
giving the following which takes all the rows from both data frames and identifies which data frame each row came from.
id countryName 2005 2006
1 1 Aruba 1 2
2 1 Angola 3 4
3 2 Aruba 11 12
4 2 Angola 13 14
2) Another possibility is to create a 3d array
library(abind)
a <- abind(DF1[-1], DF2[-1], along = 3, new.names = list(DF1$countryName,NULL,1:2))
a
giving this 3d array where the dimensions correspond to the country name, the year and the originating data.frame.
, , 1
2005 2006
Aruba 1 2
Angola 3 4
, , 2
2005 2006
Aruba 11 12
Angola 13 14
We can get various slices:
> a["Aruba",,]
1 2
2005 1 11
2006 2 12
> a[,"2005",]
1 2
Aruba 1 11
Angola 3 13
> a[,,2]
2005 2006
Aruba 11 12
Angola 13 14
Note
DF1 <- structure(list(countryName = c("Aruba", "Angola"), `2005` = c(1L,
3L), `2006` = c(2L, 4L)), class = "data.frame", row.names = c(NA, -2L))
DF2 <- structure(list(countryName = c("Aruba", "Angola"), `2005` = c(11L,
13L), `2006` = c(12L, 14L)), class = "data.frame", row.names = c(NA, -2L))
> DF1
countryName 2005 2006
1 Aruba 1 2
2 Angola 3 4
> DF2
countryName 2005 2006
1 Aruba 11 12
2 Angola 13 14

How to create a data group (factor variables) in my dataframe based on categorical variables #R

I want to create a factor variables in my dataframes based on categorical variables.
My data:
# A tibble: 159 x 3
name.country gpd rate_suicide
<chr> <dbl> <dbl>
1 Afghanistan 2129. 6.4
2 Albania 12003. 5.6
3 Algeria 11624. 3.3
4 Angola 7103. 8.9
5 Antigua and Barbuda 19919. 0.5
6 Argentina 20308. 9.1
7 Armenia 10704. 5.7
8 Australia 47350. 11.7
9 Austria 52633. 11.4
10 Azerbaijan 14371. 2.6
# ... with 149 more rows
I want to create factor variable region, which contains a factors as:
region <- c('Asian', 'Europe', 'South America', 'North America', 'Africa')
region = factor(region, levels = c('Asian', 'Europe', 'South America', 'North America', 'Africa'))
I want to do this with dplyr packages, that can to choose a factor levels depends on name.countrybut it doesn't work. Example:
if (new_data$name.country[new_data$name.country == "N"]) {
mutate(new_data, region_ = region[1])
}
How i can solve the problem?
I think the way I would think about your problem is
Create a reproducible problem. (see How to make a great R reproducible example. ) Since you already have the data, use dput to make it easier for people like me to recreate your data in their environment.
dput(yourdf)
structure(list(name.country = c("Afghanistan", "Albania", "Algeria"
), gpd = c(2129L, 12003L, 11624L), rate_suicide = c(6.4, 5.6,
3.3)), class = "data.frame", row.names = c(NA, -3L))
raw_data<-structure(list(name.country = c("Afghanistan", "Albania", "Algeria"
), gpd = c(2129L, 12003L, 11624L), rate_suicide = c(6.4, 5.6,
3.3)), class = "data.frame", row.names = c(NA, -3L))
Define vectors that specify your regions
Use case_when to separate countries into regions
Use as.factor to convert your character variable to a factor
asia=c("Afghanistan","India","...","Rest of countries in Asia")
europe=c("Albania","France","...","Rest of countries in Europe")
africa=c("Algeria","Egypt","...","Rest of countries in Africa")
df<-raw_data %>%
mutate(region=case_when(
name.country %in% asia ~ "asia",
name.country %in% europe ~ "europe",
name.country %in% africa ~ "africa",
TRUE ~ "other"
)) %>%
mutate(region=region %>% as.factor())
You can check that your variable region is a factor using str
str(df)
'data.frame': 3 obs. of 4 variables:
$ name.country: chr "Afghanistan" "Albania" "Algeria"
$ gpd : int 2129 12003 11624
$ rate_suicide: num 6.4 5.6 3.3
$ region : Factor w/ 3 levels "africa","asia",..: 2 3 1
Here is a working example that combines data from the question with a file of countries and region information from Github. H/T to Luke Duncalfe for maintaining the region data, which is:
...a combination of the Wikipedia ISO-3166 article for alpha and numeric country codes and the UN Statistics site for countries' regional and sub-regional codes.
regionFile <- "https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv"
regionData <- read.csv(regionFile,header=TRUE)
textFile <- "rowID|country|gdp|suicideRate
1|Afghanistan|2129.|6.4
2|Albania|12003.|5.6
3|Algeria|11624.|3.3
4|Angola|7103.|8.9
5|Antigua and Barbuda|19919.|0.5
6|Argentina|20308.|9.1
7|Armenia|10704.|5.7
8|Australia|47350.|11.7
9|Austria|52633.|11.4
10|Azerbaijan|14371.|2.6"
data <- read.csv(text=textFile,sep="|")
library(dplyr)
data %>%
left_join(.,regionData,by = c("country" = "name"))
...and the output:
rowID country gdp suicideRate alpha.2 alpha.3 country.code
1 1 Afghanistan 2129 6.4 AF AFG 4
2 2 Albania 12003 5.6 AL ALB 8
3 3 Algeria 11624 3.3 DZ DZA 12
4 4 Angola 7103 8.9 AO AGO 24
5 5 Antigua and Barbuda 19919 0.5 AG ATG 28
6 6 Argentina 20308 9.1 AR ARG 32
7 7 Armenia 10704 5.7 AM ARM 51
8 8 Australia 47350 11.7 AU AUS 36
9 9 Austria 52633 11.4 AT AUT 40
10 10 Azerbaijan 14371 2.6 AZ AZE 31
iso_3166.2 region sub.region intermediate.region
1 ISO 3166-2:AF Asia Southern Asia
2 ISO 3166-2:AL Europe Southern Europe
3 ISO 3166-2:DZ Africa Northern Africa
4 ISO 3166-2:AO Africa Sub-Saharan Africa Middle Africa
5 ISO 3166-2:AG Americas Latin America and the Caribbean Caribbean
6 ISO 3166-2:AR Americas Latin America and the Caribbean South America
7 ISO 3166-2:AM Asia Western Asia
8 ISO 3166-2:AU Oceania Australia and New Zealand
9 ISO 3166-2:AT Europe Western Europe
10 ISO 3166-2:AZ Asia Western Asia
region.code sub.region.code intermediate.region.code
1 142 34 NA
2 150 39 NA
3 2 15 NA
4 2 202 17
5 19 419 29
6 19 419 5
7 142 145 NA
8 9 53 NA
9 150 155 NA
10 142 145 NA
At this point one can decide whether to use the region, sub region, or intermediate region and convert it to a factor.
We can set region to a factor by adding a mutate() function to the dplyr pipeline:
data %>%
left_join(.,regionData,by = c("country" = "name")) %>%
mutate(region = factor(region)) -> mergedData
At this point mergedData$region is a factor.
str(mergedData$region)
table(mergedData$region)
> str(mergedData$region)
Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 2 3 5 4 3
> table(mergedData$region)
Africa Americas Asia Europe Oceania
2 2 3 2 1
Now the data is ready for further analysis. We will generate a table of average suicide rates by region.
library(knitr) # for kable
mergedData %>% group_by(region) %>%
summarise(suicideRate = mean(suicideRate)) %>%
kable(.)
...and the output:
|region | suicideRate|
|:--------|-----------:|
|Africa | 6.1|
|Americas | 4.8|
|Asia | 4.9|
|Europe | 8.5|
|Oceania | 11.7|
When rendered in an HTML / markdown viewer, the result looks like this:

How do you match a numeric value to a categorical value in another data set

I have two data sets. One with a numeric value assigned to individual categorical variables (country name) and a second with survey responses including a person's nationality. How do I assign the numeric value to a new column in the survey dataset with matching nationality/country name?
Here is the head of data set 1 (my.data1):
EN HCI
1 South Korea 0.845
2 UK 0.781
3 USA 0.762
Here is the head of data set 2 (my.data2):
Nationality OIS IR
1 South Korea 2 2
2 South Korea 3 3
3 USA 3 4
4 UK 3 3
I would like to make it look like this:
Nationality OIS IR HCI
1 South Korea 2 2 0.845
2 South Korea 3 3 0.845
3 USA 3 4 0.762
4 UK 3 3 0.781
I have tried this but unsuccessfully:
my.data2$HCI <- NA
for (i in i:nrow(my.data2)) {
my.data2$HCI[i] <- my.data1$HCI[my.data1$EN == my.data2$Nationality[i]]
}
We can use a left_join
library(dplyr)
left_join(my.data2, my.data1, by = c("Nationality" = "EN"))
Or with merge from base R
merge(my.data2, my.data1, by.x = c("Nationality", by.y = "EN", all.x = TRUE)

How can I alter the values of certain rows in a column, based on a condition from another column in a dataframe, using the ifelse function?

So I have this first dataframe (fish18) which consists of data on fish specimens, and a column "grade" that is to be filled with conditions in an ifelse function.
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo NA 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India NA 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa NA 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa NA 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa NA 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States NA 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
And after filling the grade column I have something like this (fish19)
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo D 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India A 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa C 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa A 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa E 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States B 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
Both dataframes have many specimens belonging to the same species of fish, and the thing is that the grades are suposed to be assigned to each species for every specimen of that species. The problem I'm having is that some rows belonging to the same species are having different grades, specially in the case of the grades "C" and "E". What I want to incorporate into my ifelse function is: Change from grade "C" to "E" every occurrence of the dataframe where two or more specimens belonging to the same species are assigned "C" in one row and "E" in another row. Because if one species has grade "E", every other row with that species name should also have grade "E".
So far I've tried the %in% function and just using "=="
Trying with %in%
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]%in%fish19$species[fish19$grade=="C"]==TRUE,"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Trying with "=="
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]==fish19$species[fish19$grade=="C"],"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Both these two options did not work and the output of this alteration should be that if one occurrence of a specific species name has the grade "E" assigned to it, so should all other occurences with that same species name.
I'm sorry if this was confusion but I tried to be as clear as I could, thank you in advance for any responses.
Kind of a long winded answer, but:
dat = data.frame('species'=c('a','b','c','a','a','b'),'grade'=c('E','E','C','C','C','D'))
dat %>% left_join(dat %>%
group_by(species) %>%
summarize(sum_e = sum(grade=='E')),by='species')
Then you could do an ifelse for sum_e>0

R aggregating on date then character

I have a table that looks like the following:
Year Country Variable 1 Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 2 5
1971 UK 2 3
1971 UK 1 5
1971 USA 2 2
1972 USA 1 1
1972 USA 2 5
I'd be grateful if someone could tell me how I can aggregate the data to group it first by year, then country with the sum of variable 1 and variable 2 coming afterwards so the output would be:
Year Country Sum Variable 1 Sum Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 5 13
1971 USA 2 2
1972 USA 3 6
This is the code I've tried to no avail (the real dataframe is 125,000 rows by 30+ columns hence the subset. Please be kind, I'm new to R!)
#making subset from data
GT2 <- subset(GT1, select = c("iyear", "country_txt", "V1", "V2"))
#making sure data types are correct
GT2[,2]=as.character(GT2[,2])
GT2[,3] <- as.numeric(as.character( GT2[,3] ))
GT2[,4] <- as.numeric(as.character( GT2[,4] ))
#removing NA values
GT2Omit <- na.omit(GT2)
#trying to aggregate - i.e. group by year, then country with the sum of Variable 1 and Variable 2 being shown
aggGT2 <-aggregate(GT2Omit, by=list(GT2Omit$iyear, GT2Omit$country_txt), FUN=sum, na.rm=TRUE)
Your aggregate is almost correct:
> aggGT2 <-aggregate(GT2Omit[3:4], by=GT2Omit[c("country_txt", "iyear")], FUN=sum, na.rm=TRUE)
> aggGT2
country_txt iyear V1 V2
1 UK 1970 1 3
2 USA 1970 1 3
3 UK 1971 5 13
4 USA 1971 2 2
5 USA 1972 3 6
dplyr is almost always the answer nowadays.
library(dplyr)
aggGT1 <- GT1 %>% group_by(iyear, country_txt) %>% summarize(sv1=sum(V1), sv2=sum(V2))
Having said that, it is good to learn basic R functions like aggregate and by.

Resources