How can I alter the values of certain rows in a column, based on a condition from another column in a dataframe, using the ifelse function? - r

So I have this first dataframe (fish18) which consists of data on fish specimens, and a column "grade" that is to be filled with conditions in an ifelse function.
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo NA 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India NA 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa NA 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa NA 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa NA 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States NA 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
And after filling the grade column I have something like this (fish19)
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo D 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India A 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa C 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa A 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa E 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States B 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
Both dataframes have many specimens belonging to the same species of fish, and the thing is that the grades are suposed to be assigned to each species for every specimen of that species. The problem I'm having is that some rows belonging to the same species are having different grades, specially in the case of the grades "C" and "E". What I want to incorporate into my ifelse function is: Change from grade "C" to "E" every occurrence of the dataframe where two or more specimens belonging to the same species are assigned "C" in one row and "E" in another row. Because if one species has grade "E", every other row with that species name should also have grade "E".
So far I've tried the %in% function and just using "=="
Trying with %in%
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]%in%fish19$species[fish19$grade=="C"]==TRUE,"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Trying with "=="
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]==fish19$species[fish19$grade=="C"],"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Both these two options did not work and the output of this alteration should be that if one occurrence of a specific species name has the grade "E" assigned to it, so should all other occurences with that same species name.
I'm sorry if this was confusion but I tried to be as clear as I could, thank you in advance for any responses.

Kind of a long winded answer, but:
dat = data.frame('species'=c('a','b','c','a','a','b'),'grade'=c('E','E','C','C','C','D'))
dat %>% left_join(dat %>%
group_by(species) %>%
summarize(sum_e = sum(grade=='E')),by='species')
Then you could do an ifelse for sum_e>0

Related

How do you match a numeric value to a categorical value in another data set

I have two data sets. One with a numeric value assigned to individual categorical variables (country name) and a second with survey responses including a person's nationality. How do I assign the numeric value to a new column in the survey dataset with matching nationality/country name?
Here is the head of data set 1 (my.data1):
EN HCI
1 South Korea 0.845
2 UK 0.781
3 USA 0.762
Here is the head of data set 2 (my.data2):
Nationality OIS IR
1 South Korea 2 2
2 South Korea 3 3
3 USA 3 4
4 UK 3 3
I would like to make it look like this:
Nationality OIS IR HCI
1 South Korea 2 2 0.845
2 South Korea 3 3 0.845
3 USA 3 4 0.762
4 UK 3 3 0.781
I have tried this but unsuccessfully:
my.data2$HCI <- NA
for (i in i:nrow(my.data2)) {
my.data2$HCI[i] <- my.data1$HCI[my.data1$EN == my.data2$Nationality[i]]
}
We can use a left_join
library(dplyr)
left_join(my.data2, my.data1, by = c("Nationality" = "EN"))
Or with merge from base R
merge(my.data2, my.data1, by.x = c("Nationality", by.y = "EN", all.x = TRUE)

cbind arguments in large dataframe

I have searched unsuccessfully for several days for an answer to this question: I have a dataframe with 279 columns and want to generate subtotals using aggregate(), or indeed, anything suitable. Here is a subset:
LGA off.cat sub.cat Jan1995 Feb1995
1 Albury Homicide Murder * 0 0
2 Albury Homicide Attempted murder 0 0
3 Albury Homicide Murder accessory, conspiracy 0 0
4 Albury Homicide Manslaughter * 0 0
5 Albury Assault Domestic violence related assault 7 7
6 Albury Assault Non-domestic violence related assault 29 20
7 Albury Assault Assault Police 12 3
8 Albury Sexual offences Sexual assault 4 3
The full dataframe contains dozens of LGA values, and many more date columns. I would like to obtain subtotals for each unique LGA value grouped by unique values of off.cat and sub.cat, summed over all dates. I tried using cbind in aggregate, but found no way to generate the 276 date column names that would not cause errors. Explicit column names worked fine. Apologies for the lack of clarity in the earlier post, and thanks to those who valiantly tried to interpret my meaning.
Your question is a bit unclear, but you may be successful using the formula syntax of aggregate. Here's an example:
df <- data.frame(group = letters[1:5],
x = 1:5,
y = 6:10,
z = 11:15)
group x y z
1 a 1 6 11
2 b 2 7 12
3 c 3 8 13
4 d 4 9 14
5 e 5 10 15
We now sum all three variables x, y and z by the levels of group, using setdiff to get a vector of column names except group, and pasting them together to use in as.formula:
aggregate(as.formula(paste(paste(setdiff(names(df), c("group")), collapse = "+"), "~ group")), data = df, sum)
group x + y + z
1 a 18
2 b 21
3 c 24
4 d 27
5 e 30
Hope this helps.

How to express a variable as a function of 2 others in a dataframe composed of 3 vectors

I know it is fundamental but I can't find the trick ...
Here is an exemple :
Species <- c("dark frog",rep(c("elephant","tiger","boa"),3),"black mamba")
Year <- c(rep(2011,4),rep(2012,3),rep(2013,4))
Abundance <- c(2,4,5,6,9,2,1,5,6,8,4)
df <- data.frame(Species, Year, Abundance)
I would like to obtain another dataframe (3 rows *5 columns) with the abundance values in function of the species as the column names (each species appearing thus only one time) and the years as the row names (appearing one time also).
May someone help me please ?
You mean something like this?
> xtabs(Abundance~Year+Species, data=df)
Species
Year black mamba boa dark frog elephant tiger
2011 0 6 2 4 5
2012 0 1 0 9 2
2013 4 8 0 5 6
The class for the above is a table, so if you prefer a data.frame instead, you can try:
library(tidyr)
new.df<- spread(df, key = Species, value = Abundance)
Year black mamba boa dark frog elephant tiger
1 2011 NA 6 2 4 5
2 2012 NA 1 NA 9 2
3 2013 4 8 NA 5 6
If you want 0s instead of NA add the following line:
new.df[is.na(new.df)]<- 0

R aggregating on date then character

I have a table that looks like the following:
Year Country Variable 1 Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 2 5
1971 UK 2 3
1971 UK 1 5
1971 USA 2 2
1972 USA 1 1
1972 USA 2 5
I'd be grateful if someone could tell me how I can aggregate the data to group it first by year, then country with the sum of variable 1 and variable 2 coming afterwards so the output would be:
Year Country Sum Variable 1 Sum Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 5 13
1971 USA 2 2
1972 USA 3 6
This is the code I've tried to no avail (the real dataframe is 125,000 rows by 30+ columns hence the subset. Please be kind, I'm new to R!)
#making subset from data
GT2 <- subset(GT1, select = c("iyear", "country_txt", "V1", "V2"))
#making sure data types are correct
GT2[,2]=as.character(GT2[,2])
GT2[,3] <- as.numeric(as.character( GT2[,3] ))
GT2[,4] <- as.numeric(as.character( GT2[,4] ))
#removing NA values
GT2Omit <- na.omit(GT2)
#trying to aggregate - i.e. group by year, then country with the sum of Variable 1 and Variable 2 being shown
aggGT2 <-aggregate(GT2Omit, by=list(GT2Omit$iyear, GT2Omit$country_txt), FUN=sum, na.rm=TRUE)
Your aggregate is almost correct:
> aggGT2 <-aggregate(GT2Omit[3:4], by=GT2Omit[c("country_txt", "iyear")], FUN=sum, na.rm=TRUE)
> aggGT2
country_txt iyear V1 V2
1 UK 1970 1 3
2 USA 1970 1 3
3 UK 1971 5 13
4 USA 1971 2 2
5 USA 1972 3 6
dplyr is almost always the answer nowadays.
library(dplyr)
aggGT1 <- GT1 %>% group_by(iyear, country_txt) %>% summarize(sv1=sum(V1), sv2=sum(V2))
Having said that, it is good to learn basic R functions like aggregate and by.

Deduplicate dataframe based on criteria in R?

I've got this dataframe:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18
As you can see the value "Jane" appears 3 times. What I want to do is to deduplicate the list based on the variable "Name" but because the rest of the columns are important to me, I want to keep the rows that have the most information in them. For example if I was to deduplicate the above file in excel, it would keep the first value of "Jane" and delete all the other ones. But the first value of "Jane" (row no3) has got missing information in the other columns.
So in other words I want to deduplicate the list by "Name" but add a criteria to keep the rows that have any other value different from "0" in the column "Age". This way the result I would get would be this:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane US F 30
4 Kate GB F 18
I have tried this
file3 <- file1[!duplicated(file1$Name),]
But like excel it keeps the value of "Jane" that has no usable information in the other columns.
How do I sort the rows based on column "Age" in a Z-A order so that anything that has "0" will be on the bottom and will be removed when I deduplicate the list?
Cheers
David
Try this trick
ind <- with(DF,
Country !=0 &
Gender %in% c('F', 'M') &
Age !=0)
DF[ind, ]
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
4 Jane US F 30
6 Kate GB F 18
So far it works well and produces your desired output
EDIT
library(doBy)
orderBy(~ -Age+Name, DF) # Sort decreasingly by Age and Name
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Or simply using Base functions:
DF[order(DF$Age, DF$Name, decreasing = TRUE), ]
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Now you can select by indexing the correct rows meeting your conditions, I really think the first part is better than these two lasts.
If all duplicated rows have the value zero in column Age, it will work with subset:
# the data
file1 <- read.table(text="Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18", header = TRUE, stringsAsFactors = FALSE)
# create a subset of the data
subset(file1, Age > 0)
# Name Country Gender Age
# 1 John GB M 25
# 2 Mark US M 35
# 4 Jane US F 30
# 6 Kate GB F 18

Resources