How to use if-statement in apply function? - r

Since I have to read over 3 go of data, I would like to improve mycode by changing two for-loop and if-statement to the applyfunction.
Here under is a reproducible example of my code. The overall purpose (in this example) is to count the number of positive and negative values in "c" column for each value of x and y. In real case I have over 150 files to read.
# Example of initial data set
df1 <- data.frame(a=rep(c(1:5),times=3),b=rep(c(1:3),each=5),c=rnorm(15))
# Another dataframe to keep track of "c" counts
dfOcc <- data.frame(a=rep(c(1:5),times=3),"positive"=c(0),"negative"=c(0))
So far I did this code, which works but is really slow:
for (i in 1:nrow(df)) {
x = df[i,"a"]
y = df[i,"b"]
if (df[i,"c"]>=0) {
dfOcc[which(dfOcc$a==x && dfOcc$b==y),"positive"] <- dfOcc[which(dfOcc$a==x && dfOcc$b==y),"positive"] +1
}else{
dfOcc[which(dfOcc$a==x && dfOcc$b==y),"negative"] <- dfOcc[which(dfOcc$a==x && dfOcc$b==y),"negative"] +1
}
}
I am unsure whether the code is slow due to the size of the files (260k rows each) or due to the for-loop?
So far I managed to improve it in this way:
dfOcc[which(dfOcc$a==df$a & dfOcc$b==df$b),"positive"] <- apply(df,1,function(x){ifelse(x["c"]>0,1,0)})
This works fine in this example but not in my real case:
It only keeps count of the positive c and running this code twice might be counter productive
My original datasets are 260k rows while my "tracer" is 10k rows (the initial dataset repeats the a and b values with other c values
Any tip on how to improve those two points would be greatly appreciated!

I think you can simply count and spread the data. This will be easier and will work on any group and dataset. You can change group_by(a) to group_by(a, b) if you want to count grouping both a and b column.
library(dplyr)
library(tidyr)
df1 %>%
group_by(a) %>%
mutate(sign = ifelse(c > 0, "Positive", "Negative")) %>%
count(sign) %>%
spread(sign, n)

package data.table might help you do this in one line.
df1 <- data.table(data.frame(a=rep(c(1:5),times=3),b=rep(c(1:3),each=5),c=rnorm(15)))
posneg <- c("positive" , "negative") # list of columns needed
df1[,(posneg) := list(ifelse(c>0, 1,0), ifelse(c<0, 1,0))] # use list to combine the 2 ifelse conditions
for more information , try
?data.table
if you really want the positive negative counts to be in a separate dataframe,
dfOcc <- df1[,c("a", "positive","negative")]

Related

Replace only certain values in column based on multiple conditions

I have a large dataframe that contains many columns, but the relevant ones are: ID (this is number assigned to subject), Time (time at which this subject's measurement was taken) and Concentration.
A very simplified example would be:
df <- data.frame( ID=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
Concentration=c("XXX",0.3,0.7,0.6,"XXX","XXX",0.8,0.3,"XXX","XXX",
"XXX",0.6,0.1,0.1,"XXX"),
Time=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5))
I would like to replace only the "XXX" values in column Concentration based on the following conditions:
when the value in column Time is less than or equal to 3; "XXX"==0
when the value in column Time is greater than 3; "XXX" should be replaced with the word "Missing"
unless two consecutive "XXX" values appear for a single subject (ID) for Time>3 then the first
consecutive "XXX" should be replaced with 0.05 and the second consecutive "XXX" (or all the following "XXX" values if there are more) should be replaced with the word "Missing".
I have tried mutate_at and replace_na, some ifelse statements and case_when but I just cannot seem to figure out how to correctly do it. Any help would be greatly appreciated!
Edit: Just to show some work:
df[df == "XXX" & df$Time<3] <- as.numeric(0)
df[df == "BLQ" & df$Time>3] <- as.character("Missing")
I have managed to find a simple and robst solution that takes care of the first two parts of my problem what I'm stuck on is the last part - when there are two or more consecutive "XXX" values for a single subject after Time>3. I imagine I should loop an ifelse statement over and index list of the ID's or something like that, but I can't figure out how to do that.
It's very important that the ID's are somehow seperated here because there could be "XXX" as the final Concentration of one ID and as the first Concentration of the next ID and I do not want that to be read as two consecutive "XXX" values for a single ID.
I solved it using some functions of tidyverse, and I also added some other records to your example.
rm(list = ls(all=TRUE))
require(tidyverse)
df <- data.frame( ID=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3),
Concentration=c("XXX",0.3,0.7,0.6,"XXX","XXX",0.8,0.3,"XXX","XXX",
"XXX",0.6,0.1,0.1,"XXX",0.2,"XXX","XXX",1),
Time=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,6,7,8,9))
df <- tibble(df) %>%
mutate(Concentration = as.character(Concentration),
Concentration_Original = Concentration) %>%
mutate(Concentration = ifelse(Concentration == 'XXX' & Time <= 3, "0", Concentration)) %>%
group_by(ID) %>%
mutate(Concentration = ifelse(Concentration == 'XXX' & Concentration == lead(Concentration),
"0.05", ifelse(Concentration == 'XXX',
"Missing", Concentration))) %>%
replace_na(list(Concentration = "Missing")) %>% ungroup()
I just had to figure this out a few minutes ago and found this question while looking for a better version. Here's mine:
value_swap <- function(dataset, specified_columns, original_val, new_val){
temp. <- dataset
temp.[, specified_columns][
temp.[, specified_columns] == original_val] <- new_val
return(temp.)
}
value_swap(mtcars, c("cyl","gear"), 4, 3.99)
You'll notice the 4s in the cylinder and gear columns mtcars are now 3.99, but the carb column is left alone.
As for your conditionals; you can just subset your dataset into 3 different ones based on your conditions, run the custom value_swaps for each condition, then rbind them back. Much simpler than doing a giant nested one in my opinion.

Summing across rows conditional on groups with dplyr using select, group_by, and mutate

Problem: I'm making an aggregate market share variable in a car market with 286 distinct models sold and a total of 501 cars sold. This group share is based on only on car characteristic: cat= "compact", "midsize", "large" and yr=77,78,79,80,81, and the share, a small double variable; a total of 15 groups in the market.
Closest answer I've found: by mishabalyasin on community.rstudio: "Calculating rowwise totals and proportions using tidyeval?" link to post on community.rstudio.
Applying the principle of select-split-combine is the closest I've come to getting the correct answer is the 15 groups (15 x 3(cat, yr, s)):
df<- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(s))
#in my actual data, this is what fills by group share to get what I want, but this isn't the desired pipele-based answer
blp$group_share=0 #initializing the group_share, the 50th col
for(i in 1:501){
for(j in 1:15){
if((blp[i,31]==df[j,1])&&(blp[i,3]==df[j,2])){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,50]=df[j,3]
}
}
}
This is great, but I know this can be done in one fell swoop... Hopefully, the idea is clear from what I've described above. A simple fix may be a loop and set by conditions on cat and yr, and that'd help, but I really am trying to get better at data wrangling with dplyr, so, any insight along that line to get the pipelining answer would be wonderful.
Example for the site: This example below doesn't work with the code I provided, but this is the "look" of my data. There is a problem with the share being a factor.
#45 obs, 3 cats, 5 yrs
cat=c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr=c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s=c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
blp=as.data.frame(cbind(unlist(lapply(cat,as.character,stringsAsFactors=FALSE)),as.numeric(yr),unlist(as.numeric(s))))
names(blp)<-c("cat","yr","s")
head(blp)
#note: one example of a group share would be summing the share from
(group_share.blp.large.81.s=(blp[cat== "large" &yr==81,]))
#works thanks to akrun: applying the code I provided for what leads to the 15 groups
df <- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(as.numeric(as.character(s))))
#manually filling doesn't work, but this is what I'd want if I didn't want pipelining
blp$group_share=0
for(i in 1:45){
if( ((blp[i,1])==(df[j,1])) && (as.numeric(blp[i,2])==as.numeric(df[j,2]))){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,4]=df[j,3];
}
}
if I understood your problem correctly this should ideally help!
Here the only difference that instead of using summarize which will automatically result only in the grouped column and the summarized one you can use mutate to keep the original columns and add to them an aggregate one.
# Sample input
## 45 obs, 3 cats, 5 yrs
cat <- c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr <- c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s <- c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
# Calculation
blp <-
data.frame(cat, yr, s, stringsAsFactors = FALSE) %>% # To create dataframe
group_by(cat, yr) %>% # Grouping by category and year
mutate(group_share = sum(s, na.rm = TRUE)) %>% # Calculating sum share per category/year
ungroup()
Expected output
Expected output

Variance of a complete group of a dataframe in R

Let's say I have a dataframe with 10+1 columns and 10 rows, and every value has the same units except for one column (the "grouping" column A).
I'm trying to accomplish the following: given a grouping of the data frames based on the last column, how do I compute the standard deviation of the whole block as a single, monolithic variable.
Let's say I do the grouping (in reality it's a cut in intervals):
df %>% group_by(A)
From what I have gathered trhoughout this site, you can use aggregate or other dplyr methods to calculate variance per column, i.e.:
this (SO won't let me embed if I have <10 rep).
In that picture we can see the grouping as colors, but by using aggregate I would get 1 standard deviation per specified column (I know you can use cbind to get more than 1 variable, for example aggregate(cbind(V1,V2)~A, df, sd)) and per group (and similar methods using dplyr and %>%, with summarise(..., FUN=sd) appended at the end).
However what I want is this: just like in Matlab when you do
group1 = df(row_group,:) % row_group would be df(:,end)==1 in this case
stdev(group1(:)) % operator (:) is key here
% iterate for every group
I have my reasons for wanting it that specific way, and of course the real dataframe is bigger than this mock example.
Minimum working example:
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")
df %>% group_by(A) %>% summarise_at(vars(V1), funs(sd(.))) # no good
aggregate(V1~A, data=df, sd) # no good
aggregate(cbind(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10)~A, data=df, sd) # nope
df %>% group_by(A) %>% summarise_at(vars(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10), funs(sd(.))) # same as above...
Result should be 3 doubles, each with the sd of the group (which should be close to 1 if enough columns are added).
If you want a base R solution, try the following.
sp <- split(df[-1], cut(df$A, breaks=c(2.1)))
lapply(sp, function(x) var(unlist(x)))
#$`(0.998,2]`
#[1] 0.848707
#
#$`(2,3]`
#[1] 1.80633
I have coded it in two lines to make it clearer but you can avoid the creation of sp and write the one-liner
lapply(split(df[-1], cut(df$A, breaks=c(2.1))), function(x) var(unlist(x)))
Or, for a result in another form,
sapply(sp, function(x) var(unlist(x)))
#(0.998,2] (2,3]
# 0.848707 1.806330
DATA
set.seed(6322) # make the results reproducible
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")

Removing rows of dataframe based on frequency of a variable

I'm working with a dataframe (in R) that contains observations of animals in the wild (recording time/date, location, and species identification). I want to remove rows that contain a certain species if there are less than x observations of them in the whole dataframe. As of now, I managed to get it to work with the following code, but I know there must be a more elegant and efficient way to do it.
namelist <- names(table(ind.data$Species))
for (i in 1:length(namelist)) {
if (table(ind.data$Species)[namelist[i]] <= 2) {
while (namelist[i] %in% ind.data$Species) {
j <- match(namelist[i], ind.data$Species)
ind.data <- ind.data[-j,]
}
}
}
The namelist vector contains all the species names in the data frame ind.data, and the if statement checks to see if the frequency of the ith name on the list is less than x (2 in this example).
I'm fully aware that this is not a very clean way to do it, I just threw it together at the end of the day yesterday to see if it would work. Now I'm looking for a better way to do it, or at least for how I could refine it.
You can do this with the dplyr package:
library(dplyr)
new.ind.data <- ind.data %>%
group_by(Species) %>%
filter(n() > 2) %>%
ungroup()
An alternative using built-in functions is to use ave():
group_sizes <- ave(ind.data$Species, ind.data$Species, FUN = length)
new.ind.data <- ind.data[group_sizes > 2, ]
We can use data.table
library(data.table)
setDT(ind.data)[, .SD[.N >2], Species]

Replacing values from a column using a condition in R

I have a very basic R question but I am having a hard time trying to get the right answer. I have a data frame that looks like this:
species <- "ABC"
ind <- rep(1:4, each = 24)
hour <- rep(seq(0, 23, by = 1), 4)
depth <- runif(length(ind), 1, 50)
df <- data.frame(cbind(species, ind, hour, depth))
df$depth <- as.numeric(df$depth)
What I would like it to select AND replace all the rows where depth < 10 (for example) with zero, but I want to keep all the information associated to those rows and the original dimensions of the data frame.
I have try the following but this does not work.
df[df$depth<10] <- 0
Any suggestions?
# reassign depth values under 10 to zero
df$depth[df$depth<10] <- 0
(For the columns that are factors, you can only assign values that are factor levels. If you wanted to assign a value that wasn't currently a factor level, you would need to create the additional level first:
levels(df$species) <- c(levels(df$species), "unknown")
df$species[df$depth<10] <- "unknown"
I arrived here from a google search, since my other code is 'tidy' so leaving the 'tidy' way for anyone who else who may find it useful
library(dplyr)
iris %>%
mutate(Species = ifelse(as.character(Species) == "virginica", "newValue", as.character(Species)))

Resources