Creating a function looping over each row in R - r

I want to write a function that creates a new column with rowmeans for Columns 1-3, only if more than 2 questions for Columns 1-3 per row were answered, otherwise print 'N'.
Here is my dataframe:
test <- data.frame(Manager1 = c(1, 3, 3), Manager2 = c(3, 4, 1), Manager3 = c(NA , 4, 2), Team1 = c(3, 4, 1))
Desired output:
Manager1 Manager2 Manager3 Team1 mean_score
1 3 3 N
3 4 4 4 3.66667
3 1 2 1 2
My code is as follows, but it's not working:
#create function
mean_score <- function(x) {
for (i in 1:nrow(test)){
if (sum(test[i, x] != "NA", na.rm = TRUE) >2){
test$mean_score[i] <- rowMeans(test[i, x], na.rm = TRUE)
} else
test$mean_score[i] <- print("N")
}
}
#compute function
mean_score(1:3)
What am I missing? Suggestions on better code are welcome too.

I think it is not ideal to put a character together with a numeric value, since it will convert the whole column into character. However, if that is what you want:
my_sum <- function(x,min=2){
s <- mean(x, na.rm = T) # get the mean
no_na <- sum(!is.na(x)) # count the number of non NAs
if(no_na>min){s}else{"N"} # return mean if enough non NAs
}
test$mean <- apply(test[,1:3],1,my_sum)
test
Manager1 Manager2 Manager3 Team1 mean
1 1 3 NA 3 N
2 3 4 4 4 3.66666666666667
3 3 1 2 1 2
str(test)
'data.frame': 3 obs. of 5 variables:
$ Manager1: num 1 3 3
$ Manager2: num 3 4 1
$ Manager3: num NA 4 2
$ Team1 : num 3 4 1
$ mean : chr "N" "3.66666666666667" "2"

You simply can use rowMeans what will return NA if there is one row holding NA what should be here equivalent to only if more than 2 questions for Columns 1-3 per row were answered.
test$mean_score <- rowMeans(test[,1:3])
# Manager1 Manager2 Manager3 Team1 mean_score
#1 1 3 NA 3 NA
#2 3 4 4 4 3.666667
#3 3 1 2 1 2.000000

While GKi has a better answer that's more simple and that you should use here is what I changed your code to be so that it works.
Generally when making a function you want to have the input be the dataframe, in this case text and changing the function from there.
Another important thing of note is you probably want to make a vector of values first and then attach said vector to the dataframe as I do in the code below, but you need to make sure you create an empty vector object to do so. R doesn't really let you slowly add cell data to a dataframe, it prefers that a vector (which can be added to) of equal length be joined to it.
Also you don't need to use print() to insert a character into a vector either.
Hope this helps explain why your function was having issues, but frankly GKi's answer is better for general R use!
mean_score <- function(x) {
mean_score <- vector()
for (i in 1:nrow(x)){
if (sum(x[i,] != "NA", na.rm = TRUE) >3){
mean_score[i] <- rowMeans(x[i,], na.rm = TRUE)
} else
mean_score[i] <- "N"
}
x$mean_score <- mean_score
return(x)
}
mean_score(test)

Related

Is there an R function to make x rows equal to a specific row and repeat the operation?

everyone!
Being a beginner with the R software (I think my request is feasible on this software), I would like to ask you a question.
In a large Excel type file, I have a column where the values I am interested in are only every 193 lines. So I would like the previous 192 rows to be equal to the value of the 193rd position ... and so on for all 193 rows, until the end of the column.
Concretely, here is what I would like to get for this little example:
Month Fund_number Cluster_ref_INPUT Expected_output
1 1 1 1
2 1 1 1
3 1 3 1
4 1 1 1
1 3 2 NA
2 3 NA NA
3 3 NA NA
4 3 NA NA
1 8 4 5
2 8 5 5
3 8 5 5
4 8 5 5
The column "Cluster_ref_INPUT" is partitioned according to the column "Fund_number" (one observation for each fund every month for 4 months). The values that interest me in the INPUT column appear every 4 observations (the value in the 4th month).
Thus, we can see that for each fund number, we find in the column "Expected_output" the values corresponding to the value found in the last line of the column "Cluster_ref_INPUT". (every 4 lines). I think we should partition by "Fund_number" and put that all the lines are equal to the last one... something like that?
Do you have any idea what code I should use to make this work?
I hope that's clear enough. Do not hesitate if I need to clarify.
Thank you very much in advance,
Vanie
Here's a one line solution using data.table:
library(data.table)
exdata <- fread(text = "
Month Fund_number Cluster_ref_INPUT Expected_output
1 1 1 1
2 1 1 1
3 1 3 1
4 1 1 1
1 2 2 NA
2 2 NA NA
3 2 NA NA
4 2 NA NA
1 3 4 5
2 3 5 5
3 3 5 5
4 3 5 5")
# You can read you data directly as data.table using fread or convert using setDT(exdata)
exdata[, newvar := Cluster_ref_INPUT[.N], by = Fund_number]
> exdata
Month Fund_number Cluster_ref_INPUT Expected_output newvar
1: 1 1 1 1 1
2: 2 1 1 1 1
3: 3 1 3 1 1
4: 4 1 1 1 1
5: 1 2 2 NA NA
6: 2 2 NA NA NA
7: 3 2 NA NA NA
8: 4 2 NA NA NA
9: 1 3 4 5 5
10: 2 3 5 5 5
11: 3 3 5 5 5
12: 4 3 5 5 5
There are probably solutions using tidyverse that'll be a lot faster, but here's a solution in base R.
#Your data
df <- data.frame(Month = rep_len(c(1:4), 12),
Fund_number = rep(c(1:3), each = 4),
Cluster_ref_INPUT = c(1, 1, 3, 1, 2, NA, NA, NA, 4, 5, 5, 5),
stringsAsFactors = FALSE)
#Create an empty data frame in which the results will be stored
outdat <- data.frame(Month = c(), Fund_number = c(), Cluster_ref_INPUT = c(), expected_input = c(), stringsAsFactors = FALSE)
#Using a for loop
#Iterate through the list of unique Fund_number values
for(i in 1:length(unique(df$Fund_number))){
#Subset data pertaining to each unique Fund_number
curdat <- subset(df, df$Fund_number == unique(df$Fund_number)[i])
#Take the value of Cluster_ref_Input from the last row
#And set it as the value for expected_input column for all rows
curdat$expected_input <- curdat$Cluster_ref_INPUT[nrow(curdat)]
#Append this modified subset to the output container data frame
outdat <- rbind(outdat, curdat)
#Go to next iteration
}
#Remove non-essential looping variables
rm(curdat, i)
outdat
# Month Fund_number Cluster_ref_INPUT expected_input
# 1 1 1 1 1
# 2 2 1 1 1
# 3 3 1 3 1
# 4 4 1 1 1
# 5 1 2 2 NA
# 6 2 2 NA NA
# 7 3 2 NA NA
# 8 4 2 NA NA
# 9 1 3 4 5
# 10 2 3 5 5
# 11 3 3 5 5
# 12 4 3 5 5
EDIT: additional solutions + benchmarking
Per OP's comment on this answer, I've presented some faster solutions (dplyr and the data.table solution from the other answer) and also benchmarked them on a 950,004 row simulated dataset similar to the one in OP's example. Code and results below; the entire code-block can be copy-pasted and run directly as long as the necessary libraries (microbenchmark, dplyr, data.table) and their dependencies are installed. (If someone knows a solution based on apply() they're welcome to add it here.)
rm(list = ls())
#Library for benchmarking
library(microbenchmark)
#Dplyr
library(dplyr)
#Data.table
library(data.table)
#Your data
df <- data.frame(Month = rep_len(c(1:12), 79167),
Fund_number = rep(c(1, 2, 5, 6, 8, 22), each = 158334),
Cluster_ref_INPUT = sample(letters, size = 950004, replace = TRUE),
stringsAsFactors = FALSE)
#Data in format for data.table
df_t <- data.table(Month = rep_len(c(1:12), 79167),
Fund_number = rep(c(1, 2, 5, 6, 8, 22), each = 158334),
Cluster_ref_INPUT = sample(letters, size = 950004, replace = TRUE),
stringsAsFactors = FALSE)
#----------------
#Base R solution
#Using a for loop
#Iterate through the list of unique Fund_number values
base_r_func <- function(df) {
#Create an empty data frame in which the results will be stored
outdat <- data.frame(Month = c(),
Fund_number = c(),
Cluster_ref_INPUT = c(),
expected_input = c(),
stringsAsFactors = FALSE)
for(i in 1:length(unique(df$Fund_number))){
#Subset data pertaining to each unique Fund_number
curdat <- subset(df, df$Fund_number == unique(df$Fund_number)[i])
#Take the value of Cluster_ref_Input from the last row
#And set it as the value for expected_input column for all rows
curdat$expected_input <- curdat$Cluster_ref_INPUT[nrow(curdat)]
#Append this modified subset to the output container data frame
outdat <- rbind(outdat, curdat)
#Go to next iteration
}
#Remove non-essential looping variables
rm(curdat, i)
#This return is needed for the base_r_func function wrapper
#this code is enclosed in (not necessary otherwise)
return(outdat)
}
#----------------
#Tidyverse solution
dplyr_func <- function(df){
df %>% #For actual use, replace this %>% with %<>%
#and it will write the output back to the input object
#Group the data by Fund_number
group_by(Fund_number) %>%
#Create a new column populated w/ last value from Cluster_ref_INPUT
mutate(expected_input = last(Cluster_ref_INPUT))
}
#----------------
#Data table solution
dt_func <- function(df_t){
#For this function, we are using
#dt_t (created above)
#Logic similar to dplyr solution
df_t <- df_t[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
dt_func_conv <- function(df){
#Converting data.frame to data.table format
df_t <- data.table(df)
#Logic similar to dplyr solution
df_t <- df_t[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
#----------------
#Benchmarks
bm_vals <- microbenchmark(base_r_func(df),
dplyr_func(df),
dt_func(df_t),
dt_func_conv(df), times = 8)
bm_vals
# Unit: milliseconds
# expr min lq mean median uq max neval
# base_r_func(df) 618.58202 702.30019 721.90643 743.02018 754.87397 756.28077 8
# dplyr_func(df) 119.18264 123.26038 128.04438 125.64418 133.37712 140.60905 8
# dt_func(df_t) 38.06384 38.27545 40.94850 38.88269 43.58225 48.04335 8
# dt_func_conv(df) 48.87009 51.13212 69.62772 54.36058 57.68829 181.78970 8
#----------------
As can be seen, using data.table would be the way to go if speed is a necessity. data.table is faster than dplyr and base R even when the overhead of converting a regular data.frame to a data.table is considered (see results of dt_func_conv()).
Edit: following up on Carlos Eduardo Lagosta's comments, using setDT() to coerce the df from a data.frame to a data.table, makes the overhead of said coercion close to nil. Code snippet and benchmark values below.
#This version includes the time taken
#to coerce a data.frame to a data.table
dt_func_conv <- function(df){
#Logic similar to dplyr solution
#setDT() coerces data.frames to the data.table format
setDT(df)[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
bm_vals
# Unit: milliseconds
# expr min lq mean median uq max neval
# base_r_func(df) 271.60196 344.47280 353.76204 348.53663 368.65696 435.16163 8
# dplyr_func(df) 121.31239 122.67096 138.54481 128.78134 138.72509 206.69133 8
# dt_func(df_t) 38.21601 38.57787 40.79427 39.53428 43.14732 45.61921 8
# dt_func_conv(df) 41.11210 43.28519 46.72589 46.74063 50.16052 52.32235 8
For the OP specifically: whatever solution you wish to use, the code you're looking for is within the body of the corresponding function. So, for instance, if you want to use the dplyr solution, you would need to take this code and tailor it to your data objects:
df %>% #For actual use, replace this %>% with %<>%
#and it will write the output back to the input object
#Group the data by Fund_number
group_by(Fund_number) %>%
#Create a new column populated w/ last value from Cluster_ref_INPUT
mutate(expected_input = last(Cluster_ref_INPUT))

Calculating moving differences across columns per row in r

I would like to do calculations across columns in my data, by row. The calculations are "moving" in that I would like to know the difference between two numbers in column 1 and 2, then columns 3 and 4, and so on. I have looked at "loops" and "rollapply" functions, but could not figure this out. Below are three options of what was attempted. Only the third option gives me the result I am after, but it is very lengthy code and also does not allow for automation (the input data will be a much larger matrix, so typing out the calculation for each row won't work).
Please advice how to make this code shorter and/or any other packages/functions to check out which will do the job. THANK YOU!
MY TEST SCRIPT IN R + errors/results
Sample data set
a<- c(1,2,3, 4, 5)
b<- c(1,2,3, 4, 5)
c<- c(1,2,3, 4, 5)
test.data <- data.frame(cbind(a,b*2,c*10))
names(test.data) <- c("a", "b", "c")
Sample of calculations attempted:
OPTION 1
require(zoo)
rollapply(test.data, 2, diff, fill = NA, align = "right", by.column=FALSE)
RESULT 1 (not what we're after. What we need is at the bottom of Option 3)
# a b c
#[1,] NA NA NA
#[2,] 1 2 10
#[3,] 1 2 10
#[4,] 1 2 10
#[5,] 1 2 10
OPTION 2:
results <- for (i in 1:length(nrow(test.data))) {
diff(as.numeric(test.data[i,]), lag=1)
print(results)}
RESULT 2: (again not what we're after)
# NULL
OPTION 3: works, but long way, so would like to simplify code and make generic for any length of observations in my dataframe and any number of columns (i.e. more than 3). I would like to "automate" the steps below, if know number of observations (i.e. rows).
row1=diff(as.numeric(test[1,], lag=1))
row2=diff(as.numeric(test[2,], lag=1))
row3=diff(as.numeric(test[3,], lag=1))
row4=diff(as.numeric(test[4,], lag=1))
row5=diff(as.numeric(test[5,], lag=1))
results.OK=cbind.data.frame(row1, row2, row3, row4, row5)
transpose.results.OK=data.frame(t(as.matrix(results.OK)))
names(transpose.results.OK)=c("diff.ab", "diff.bc")
Final.data = transpose.results.OK
print(Final.data)
RESULT 3: (THIS IS WHAT I WOULD LIKE TO GET, "row1" can be "obs1" etc)
# diff.ab diff.bc
#row1 1 8
#row2 2 16
#row3 3 24
#row4 4 32
#row5 5 40
THE END
Here are the 3 options redone plus a 4th option:
# 1
library(zoo)
d <- t(rollapplyr(t(test.data), 2, diff, by.column = FALSE))
# 2
d <- test.data[-1]
for (i in 1:nrow(test.data)) d[i, ] <- diff(unlist(test.data[i, ]))
# 3
d <- t(diff(t(test.data)))
# 4 - also this works
nc <- ncol(test.data)
d <- test.data[-1] - test.data[-nc]
For any of them to set the names:
colnames(d) <- paste0("diff.", head(names(test.data), -1), colnames(d))
(2) and (4) give this data.frame and (1) and (3) give the corresponding matrix:
> d
diff.ab diff.bc
1 1 8
2 2 16
3 3 24
4 4 32
5 5 40
Use as.matrix or as.data.frame if you want the other.
An apply based solution using diff on row-wise can be achieved as:
# Result
res <- t(apply(test.data, 1, diff)) #One can change it to data.frame
# Name of the columns
colnames(res) <- paste0("diff.", head(names(test.data), -1),
tail(names(test.data), -1))
res
# diff.ab diff.bc
# [1,] 1 8
# [2,] 2 16
# [3,] 3 24
# [4,] 4 32
# [5,] 5 40

Complete.cases used on list of data frames

I'm trying to remove all the NA values from a list of data frames. The only way I have got it to work is by cleaning the data with complete.cases in a for loop. Is there another way of doing this with lapply as I had been trying for a while to no avail. Here is the code that works.
I start with
data_in <- lapply (file_name,read.csv)
Then have:
clean_data <- list()
for (i in seq_along(id)) {
clean_data[[i]] <- data_in[[i]][complete.cases(data_in[[i]]), ]
}
But what I tried to get to work was using lapply all the way like this.
comp <- lapply(data_in, complete.cases)
clean_data <- lapply(data_in, data_in[[id]][comp,])
Which returns this error "Error in [.default(xj, i) : invalid subscript type 'list' "
What I'd like to know is some alternatives or if I was going about this right. And why didn't the last example not work?
Thank you so much for your time. Have a nice day.
I'm not sure what you expected with
clean_data <- lapply(data_in, data_in[[id]][comp,])
The second parameter to lapply should be a proper function to which each member of the data_in list will be passed one at a time. Your expression data_in[[id]][comp,] is not a function. I'm not sure where you expected id to come from, but lapply does not create magic variables for you like that. Also, at this point comp is now a list itself of indices. You are making no attempt to iterate over this list in sync with your data_in list. If you wanted to do it in two separate steps, a more appropriate approach would be
comp <- lapply(data_in, complete.cases)
clean_data <- Map(function(d,c) {d[c,]}, data_in, comp)
Here we use Map to iterate over the data_in and comp lists simultaneously. They each get passed in to the function as a parameter and we can do the proper extraction that way. Otherwise, if we wanted to do it in one step, we could do
clean_data <- lapply(data_in, function(x) x[complete.cases(x),])
welcome to SO, please provide some working code next time
here is how i would do it with na.omit (since complete.cases only returns a logical)
(dat.l <- list(dat1 = data.frame(x = 1:2, y = c(1, NA)),
dat2 = data.frame(x = 1:3, y = c(1, NA, 3))))
# $dat1
# x y
# 1 1 1
# 2 2 NA
#
# $dat2
# x y
# 1 1 1
# 2 2 NA
# 3 3 3
Map(na.omit, dat.l)
# $dat1
# x y
# 1 1 1
#
# $dat2
# x y
# 1 1 1
# 3 3 3
Do you mean like the below?
> lst
$a
a
1 1
2 2
3 NA
4 3
5 4
$b
b
1 1
2 NA
3 2
4 3
5 4
$d
d e
1 NA 1
2 NA 2
3 3 3
4 4 NA
5 5 NA
> f <- function(x) x[complete.cases(x),]
> lapply(lst, f)
$a
[1] 1 2 3 4
$b
[1] 1 2 3 4
$d
d e
3 3 3
file_name[complete.cases(file_name), ]
complete.cases() returns only a logical value. This should do the job and returns only the rows with no NA values.

For Loop in R - deleting all rows which match by one variable

I'm trying to completely delete rows in a dataset for cases with matching variables (case ID) with the help of this function I wrote:
del_row_func <- function(x){
for(i in 1:length(x$FALL_ID)){
for(j in 1:length(x$FALL_ID)){
if(x$FALL_ID[i] == x$FALL_ID[j] & i != j){
x[-i, ]
}
}
}
}
Anybody have an idea, why it doesn't work?
The reason your code didn't work was that you weren't modifying or returning x. However, there is a better way to remove all rows with a duplicated ID:
dat = data.frame(FALL_ID = c(1, 2, 2, 3), y = 1:4)
dat
# FALL_ID y
# 1 1 1
# 2 2 2
# 3 2 3
# 4 3 4
dat[!duplicated(dat$FALL_ID) & !duplicated(dat$FALL_ID, fromLast=T),]
# FALL_ID y
# 1 1 1
# 4 3 4

How do I take subsets of a data frame according to a grouping in R?

I have an aggregation problem which I cannot figure out how to perform efficiently in R.
Say I have the following data:
group1 <- c("a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b")
group2 <- c(1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1)
value <- c("apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape")
df <- data.frame(group1,group2,value)
I am interested in sampling from the data frame df such that I randomly pick only a single row from each combination of factors group1 and group2.
As you can see, the results of table(df$group1,df$group2)
1 2 3 4 5 6
a 2 1 2 1 0 0
b 2 2 1 1 0 0
c 0 0 1 1 2 1
shows that some combinations are seen more than once, while others are never seen. For those that are seen more than once (e.g., group1="a" and group2=3), I want to randomly pick only one of the corresponding rows and return a new data frame that has only that subset of rows. That way, each possible combination of the grouping factors is represented by only a single row in the data frame.
One important aspect here is that my actual data sets can contain anywhere from 500,000 rows to >2,000,000 rows, so it is important to be mindful of performance.
I am relatively new at R, so I have been having trouble figuring out how to generate this structure correctly. One attempt looked like this (using the plyr package):
choice <- function(x,label) {
cbind(x[sample(1:nrow(x),1),],data.frame(state=label))
}
df <- ddply(df[,c("group1","group2","value")],
.(group1,group2),
pick_junc,
label="test")
Note that in this case, I am also adding an extra column to the data frame called "label" which is specified as an extra argument to the ddply function. However, I killed this after about 20 min.
In other cases, I have tried using aggregate or by or tapply, but I never know exactly what the specified function is getting, what it should return, or what to do with the result (especially for by).
I am trying to switch from python to R for exploratory data analysis, but this type of aggregation is crucial for me. In python, I can perform these operations very rapidly, but it is inconvenient as I have to generate a separate script/data structure for each different type of aggregation I want to perform.
I want to love R, so please help! Thanks!
Uri
Here is the plyr solution
set.seed(1234)
ddply(df, .(group1, group2), summarize,
value = value[sample(length(value), 1)])
This gives us
group1 group2 value
1 a 1 apple
2 a 2 nectarine
3 a 3 banana
4 a 4 apple
5 b 1 grape
6 b 2 blackberry
7 b 3 guava
8 b 4 lemon
9 c 3 durian
10 c 4 durian
11 c 5 raspberry
12 c 6 lime
EDIT. With a data frame that big, you are better off using data.table
library(data.table)
dt = data.table(df)
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
EDIT 2: Performance Comparison: Data Table is ~ 15 X faster
group1 = sample(letters, 1000000, replace = T)
group2 = sample(LETTERS, 1000000, replace = T)
value = runif(1000000, 0, 1)
df = data.frame(group1, group2, value)
dt = data.table(df)
f1_dtab = function() {
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
}
f2_plyr = function() {ddply(df, .(group1, group2), summarize, value =
value[sample(length(value), 1)])
}
f3_by = function() {do.call(rbind,by(df,list(grp1 = df$group1,grp2 = df$group2),
FUN = function(x){x[sample(nrow(x),1),]}))
}
library(rbenchmark)
benchmark(f1_dtab(), f2_plyr(), f3_by(), replications = 10)
test replications elapsed relative
f1_dtab() 10 4.764 1.00000
f2_plyr() 10 68.261 14.32851
f3_by() 10 67.369 14.14127
One more way:
with(df, tapply(value, list( group1, group2), length))
1 2 3 4 5 6
a 2 1 2 1 NA NA
b 2 2 1 1 NA NA
c NA NA 1 1 2 1
# Now use tapply to sample withing groups
# `resample` fn is from the sample help page:
# Avoids an error with sample when only one value in a group.
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Create a row index
df$idx <- 1:NROW(df)
rowidxs <- with(df, unique( c( # the `c` function will make a matrix into a vector
tapply(idx, list( group1, group2),
function (x) resample(x, 1) ))))
rowidxs
# [1] 1 5 NA 12 16 NA 3 15 6 4 14 10 NA NA 7 NA NA 8
df[rowidxs[!is.na(rowidxs)] , ]

Resources