How to replace certain values in a dataset in R - r

I'm creating a new column in my data set that is a copy of a preexisting column but I need to change 2 of the 7 values to something else.
I have tried doing this
Dataset$category_alt = Dataset$category_title
Dataset[Dataset$category_alt == "Shows"] <- "Other"
Dataset[Dataset$category_alt == "Nonprofits & Activism"] <- "Other"
But I receive this error
Error in [<-.data.frame(*tmp*, USvideos$category_alt == "Shows", value = "Other") :
duplicate subscripts for columns

You're missing the new column name in the assignment.
Dataset$category_alt = Dataset$category_title
Dataset$category_alt[Dataset$category_alt == "Shows"] <- "Other"
Dataset$category_alt[Dataset$category_alt == "Nonprofits & Activism"] <- "Other"

We can also use %in% to make the code clearer and shorter
Dataset$category_alt[Dataset$category_alt %in% c('Shows', 'Nonprofits & Activism')] <-'Other'
I usually prefer to use dplyr in such situations:
library(dplyr)
Dataset %>% mutate(category_alt = replace(category_alt, category_alt %in% c('Shows', 'Nonprofits & Activism'), 'Other))

Related

Quickly lookup to another data frame using two column values in R

I have a data frame (call it 'ModelOutput') with three columns (Trial, DurationRet, DiscountRate) and another (call it 'drdata') with three columns (Scenario, variable, value).
I want to quickly filter drdata$Scenario == ModelOutput$Trial & drdata$variable == ModelOutput$DurationRet to return drdata$value into the ModelOutput$DiscountRate column. Is there a way to do this efficiently?
Here are my two attempts, the first of which fails and the second of which is entirely too slow.
ModelOutput$Trial <- drdata[drdata$Scenario == ModelOutput$Trial & drdata$variable == ModelOutput$DurationRet,"value"]
foreach(row = 1:nrow(ModelOutput)) %do%{
ModelOutput[row, "DiscountRate"] <- drdata[drdata$Scenario == ModelOutput[row, "Trial"] & drdata$variable == as.factor(ModelOutput[row,"DurationRet"]+1),"value"]
}
It took me a minute, but I realized joins could do the job I was looking for.
Here is my final code:
ModelOutput <- ModelOutput %>% full_join(drdata, by = c(Trial = "Scenario", DurationRet = "variable"))

IF statement in R. Add new elements in new column if conditions met

I'm trying to add a column in the data frame where the new element in the new column has the value of "1" if the conditions are met for that particular row.
To check the condition I am iterating through another reference data frame.
county_list = (df$county_name[df$wolves_present_in_county==1 & df$year==2015])
for (i in df$county_name) {
for (j in county_list) {
if (df$county_name[i]==county_list[j])
{
df$wolvein2015 = 1
break
}
}
}
Error in Output
Dataset
I think you can do what you want in base R. Here is an example using mtcars:
cars <- mtcars
cars$new <- ifelse(cars$cyl == 4 & cars$mpg > 30, 1, 0)
The new column is added with 0/1's based on conditions of 2 other variables.
BTW, because R is a vectorized paradigm, you should only use for loops as a last resort.
Here is an option with dplyr
library(dplyr)
df %>%
mutate(wolvein2015 = +((year == 2015 & !is.na(year)) &
as.logical(wolves_present_in_county) & !is.na(wolves_present_in_county)))

R - Flexible conditions

I have the folowing R statement. Basically it goes through the entire matchesData data frame and checks if the conditions are matched for each row.
If it matches, put a '1' at matchesData$isRedPreferredLineup.
matchesData$isRedPreferredLineup <- ifelse((matchesData$redTop==red_poplist[1] &
matchesData$redADC==red_poplist[2] &
matchesData$redJungle==red_poplist[3] &
matchesData$redSupport==red_poplist[4] &
matchesData$redMiddle==red_poplist[5] &
matchesData$YearSeason==Season), 1,
matchesData$isRedPreferredLineup)
However, now I need the condition to be flexible. Meaning, if
matchesData$redTop==red_poplist[1]
matchesData$redADC==red_poplist[2]
matchesData$redJungle==red_poplist[3]
conditions are matched, or if
matchesData$redJungle==red_poplist[3]
matchesData$redSupport==red_poplist[4]
matchesData$redMiddle==red_poplist[5]
conditions are matched, or any other permutation comprising 3 or more of the following conditions are matched, I would like to put '1' at matchesData$isRedPreferredLineup.
(matchesData$redTop==red_poplist[1] &
matchesData$redADC==red_poplist[2] &
matchesData$redJungle==red_poplist[3] &
matchesData$redSupport==red_poplist[4] &
matchesData$redMiddle==red_poplist[5] &
matchesData$YearSeason==Season)
How can I do so in a vectorized ifelse statement like this?
Or is there a better way to do this?
Please bear with me, I am pretty new to R. Thanks.
Maybe this coud work:
selectIndex <- apply(matchesData,1,function(row){
sum(c(row['redTop'] == red_poplist[1],
row['redADC'] == red_poplist[2],
row['redJungle'] == red_poplist[3],
row['redSupport'] == red_poplist[4],
row['redMiddle'] == red_poplist[5],
row['YearSeason'] == Season) > 3)
})
matchesData$isRedPreferredLineup[selectIndex] <- 1
You could vectorise the TRUE/FALSE statements like this:
my.conditions <- cbind(matchesData$redTop==red_poplist[1], matchesData$redADC==red_poplist[2],
matchesData$redJungle==red_poplist[3], matchesData$redSupport==red_poplist[4],
matchesData$redMiddle==red_poplist[5], matchesData$YearSeason==Season)
Then you could consider S1 <- rowSums(my.conditions) which will give you the number of TRUEs in my.conditions and then (your final condition would boil down to ifelse(S1 > 2, 1, ...)) consider the following:
matchesData$isRedPreferredLineup[which(S1 > 2)] <- 1

Can I use a function as a return argument in ifelse in "R"

I'm currently researching a matching-to-sample task in monkeys. I want to evaluate how often a certain stimulus was chosen, regardless of correctness of the choice.
To do so, I have a dataframe df with 6288 rows and 6 columns ("Monkey", "Session", "Sample", "Match", "Foil", "Success"), of which only the last three are important now.
The data in df$Match and df$Foilare the names of the stimuli (string) and df$Success is binary. df$Match and df$Foil are made up of 65 distinct stimuli names, which I included in a vector Match.Foil.
Now I want to count how often a picture (part of the vector Match.Foil) is clicked in all 6288 trials. That is, everytime the name is either part of df$Match & df$Success == "1" OR when df$Foil & df$Success == "0".
I tried to build a vector with the number of times clicked for each part of Match.Foil like this:
Pic.clicked= vector(mode="numeric", length= length(Match.Foil))
for (i in 1:length(Match.Foil)){
Pic.clicked[i] = ifelse(
df$Match == Match.Foil[i] & df$Success == "1")|
(df$Foil== Match.Foil[i] & df$Success == "0"),
Pic.clicked[i] +1,
Pic.clicked[i] +0)
}
So, as you see I wanted to use the functions Pic.clicked + 1 and Pic.clicked + 0 as the returns if the statement is TRUE or FALSE. It does not work and gives me the error:
In Pic.clicked[i] = ifelse((df$Match == Match.Foil[i] & ... :
number of items to replace is not a multiple of replacement length
Does anybody have an idea, how to build an appropriate counter? I thought about using switch, but I don't have any experience with that function and it seems not to work like I need it. I also tried running it for 6288 loops, but that produces the same warning.
you can use sum(), which on a boolean vector makes TRUE count as 1:
for (i in 1:length(Match.Foil)) {
Pic.clicked[i]= sum((Stage4.pics$Match == Match.Foil[i] & Stage4.pics$Success == "1")|
(Stage4.pics$Foil== Match.Foil[i] & Stage4.pics$Success == "0"))
}

Count number of rows matching a criteria

I am looking for a command in R which is equivalent of this SQL statement. I want this to be a very simple basic solution without using complex functions OR dplyr type of packages.
Select count(*) as number_of_states
from myTable
where sCode = "CA"
so essentially I would be counting number of rows matching my where condition.
I have imported a csv file into mydata as a data frame.So far I have tried these with no avail.
nrow(mydata$sCode == "CA") ## ==>> returns NULL
sum(mydata[mydata$sCode == 'CA',], na.rm=T) ## ==>> gives Error in FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(subset(mydata, sCode='CA', select=c(sCode)), na.rm=T) ## ==>> FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(mydata$sCode == "CA", na.rm=T) ## ==>> returns count of all rows in the entire data set, which is not the correct result.
and some variations of the above samples. Any help would be appreciated! Thanks.
mydata$sCode == "CA" will return a boolean array, with a TRUE value everywhere that the condition is met. To illustrate:
> mydata = data.frame(sCode = c("CA", "CA", "AC"))
> mydata$sCode == "CA"
[1] TRUE TRUE FALSE
There are a couple of ways to deal with this:
sum(mydata$sCode == "CA"), as suggested in the comments; because
TRUE is interpreted as 1 and FALSE as 0, this should return the
numer of TRUE values in your vector.
length(which(mydata$sCode == "CA")); the which() function
returns a vector of the indices where the condition is met, the
length of which is the count of "CA".
Edit to expand upon what's happening in #2:
> which(mydata$sCode == "CA")
[1] 1 2
which() returns a vector identify each column where the condition is met (in this case, columns 1 and 2 of the dataframe). The length() of this vector is the number of occurences.
sum is used to add elements; nrow is used to count the number of rows in a rectangular array (typically a matrix or data.frame); length is used to count the number of elements in a vector. You need to apply these functions correctly.
Let's assume your data is a data frame named "dat". Correct solutions:
nrow(dat[dat$sCode == "CA",])
length(dat$sCode[dat$sCode == "CA"])
sum(dat$sCode == "CA")
mydata$sCode is a vector, it's why nrow output is NULL.
mydata[mydata$sCode == 'CA',] returns data.frame where sCode == 'CA'. sCode includes character. That's why sum gives you the error.
subset(mydata, sCode='CA', select=c(sCode)), you should use sCode=='CA' instead sCode='CA'. Then subset returns you vector where sCode equals CA, so you should use
length(subset(na.omit(mydata), sCode='CA', select=c(sCode)))
Or you can try this: sum(na.omit(mydata$sCode) == "CA")
With dplyr package, Use
nrow(filter(mydata, sCode == "CA")),
All the solutions provided here gave me same error as multi-sam but that one worked.
Just give a try using subset
nrow(subset(data,condition))
Example
nrow(subset(myData,sCode == "CA"))
to get the number of observations the number of rows from your Dataset would be more valid:
nrow(dat[dat$sCode == "CA",])
grep command can be used
CA = mydata[grep("CA", mydata$sCode, ]
nrow(CA)
Call nrow passing as argument the name of the dataset:
nrow(dataset)
I'm using this short function to make it easier using dplyr:
countc <- function(.data, ..., preserve = FALSE){
return(nrow(filter(.data, ..., .preserve = preserve)))
}
With this you can just use it like filter. For example:
countc(data, active == TRUE)
[1] 42

Resources