I have a large data frame which has multiple columns calculated from other columns. The issues come where there are values of 8888 and 9999 which constitute NA or refused to answer respectively. These values have been incorrectly used to calculate other columns (such as the value of pricepergram) as they have not been signaled as NA prior to calculation.
I'm not able to recalculate all the values, so instead I would like to find some code, which takes in as an argument each row of the dataframe. If the maximum value in the row is above 8887, then I would like it to return the row but with the value of all prices set to NA.
the solution needs to be applicable to a dataframe of 250 columns.
I need to be able to apply the code across multiple columns, rather than just one.
I have confirmed that the only values above 8887 in the dataframe are indeed either 9999 or 8888 and therefore constitute values that we want to change.
I am not able to post the dataset due to data protection (apologies), but have given an example of minimum complexity to illustrate my point.
This would be the ideal output:
The rows with values above 8887 have had their price set to NA.
We can break this problem into two steps:
find out if there are any 8888 or 9999 codes in a row
set values in the row to NA
Step 1: The following code produces an indicator for whether a row contains any codes greater than 8887:
any_large_codes = apply(df, MARGIN = 1, function(row){any(row > 8887)})
It works as follows: apply treats the dataframe as a matrix. MARGIN = 1 means that the function is applied to each row of the matrix. function(row){any(row > 8887)} checks if any value in its input (each row) is larger than 8887.
I have not used dplyr for this as I am not aware of any row-wise operators in dplyr. This seems the best option. You can use dplyr to add it into the dataframe if you wish, but this is not necessary:
df = df %>% mutate(na_indicator = any_large_codes)
Step 2: The following code sets the values in a single column to NA where there are any large codes:
df = df %>%
mutate(this_one_column = ifelse(any_large_codes, NA, this_one_column))
If you want to handle multiple columns, I would suggest something like this:
all_columns_to_handle = c(
"col1",
"col2",
"col3",
...
)
for(cc in all_columns_to_handle){
df = df %>%
mutate(!!sym(cc) := ifelse(any_large_codes, NA, !!sym(cc)))
}
Where !!sym(cc) is a way to use the column name stored in cc and := is equivalent to = but allows us to use !!sym(cc) on the left-hand side. For other options to this approach see the programming with dplyr vignette.
Related
I'm trying to wirte a function to process multiple similar dataset, here I want to subtract scores obtained by subject in the second interview by scores obtained by the same subject in the previous interview. In all dataset I want to process, interested score will be stored in the second column. Writing for each specific dataset is simple, simply use the exact column name, everything will go fine.
d <- a %>%
arrange(by_group=interview_date) %>%
dplyr::group_by(subjectkey) %>%
dplyr::mutate(score_change = colname_2nd-lag(colname_2nd))
But since I need a generic function that can be used to process multiple dataset, I can not use exact column name. So I tried 3 approaches, both of them only altered the last line
Approach#1:
dplyr::mutate(score_change = dplyr::vars(2)-lag(dplyr::vars(2)))
Approach#2:
Second column name of interested dataset contains a same string ,so I tried
dplyr::mutate(score_change = dplyr::vars(matches('string'))-lag(dplyr::vars(matches('string'))))
Error messages of the above 2 approaches will be
Error in dplyr::vars(2) - lag(dplyr::vars(2)) :
non-numeric argument to binary operator
Approach#3:
dplyr::mutate(score_change = .[[2]]-lag(.[[2]]))
Error message:
Error: Column `score_change` must be length 2 (the group size) or one, not 10880
10880 is the row number of my sample dataset, so it look like group_by does not work in this approach
Does anyone know how to make the function perform in the desired way?
If you want to use position of the column names use cur_data()[[2]] to refer the 2nd column of the dataframe.
library(dplyr)
d <- a %>%
arrange(interview_date) %>%
dplyr::group_by(subjectkey) %>%
dplyr::mutate(score_change = cur_data()[[2]]-lag(cur_data()[[2]]))
Also note that cur_data() doesn't count the grouped column so if subjectkey is first column in your data and colname_2nd is the second one you may need to use cur_data()[[1]] instead when you group_by.
I am working with a data-frame in R. I have the following function which removes all rows of a data-frame df where, for a specified column index/attribute, the value at that row is outside mean (of column) plus or minus n*stdev (of column).
remove_outliers <- function(df,attr,n){
outliersgone <- df[df[,attr]<=(mean(df[,attr],na.rm=TRUE)+n*sd(df[,attr],na.rm=TRUE)) & df[,attr]>=(mean(df[,attr],na.rm=TRUE)-n*sd(df[,attr],na.rm=TRUE)),]
return(outliersgone)
}
There are two parts to my question.
(1) My data-frame df also has a column 'Group', which specifies a class label. I would like to be able to remove outliers according to mean and standard deviation within their group within the column, i.e. organised by factor (within the column). So you would remove from the data-frame a row labelled with group A if, in the specified column/attribute, the value at that row is outside mean (of group A rows in that column) plus/minus n*stdev (of group A rows in that column). And the same for groups B, C, D, E, F, etc.
How can I do this? (Preferably using only base R and dplyr.) I have tried to use df %>% group_by(Group) followed by mutate but I'm not sure what to pass to mutate, given my function remove_outliers seems to require the whole data-frame to be passed into it (so it can return the whole data-frame with rows only removed based on the chosen attribute attr).
I am open to hearing suggestions for changing the function remove_outliers as well, as long as they also return the whole data-frame as explained. I'd prefer solutions that avoid loops if possible (unless inevitable and no more efficient method presents itself in base R / dplyr).
(2) Is there a straightforward way I could combine outlier considerations across multiple columns? e.g. remove from the dataframe df those rows which are outliers wrt at least $N$ attributes out of a specified vector of attributes/column indices (length≥N). or a more complex condition like, remove from the dataframe df those rows which are outliers wrt Attribute 1 and at least 2 of Attributes 2,4,6,8.
(Ideally the definition of outlier would again be within-group within column, as specified in question 1 above, but a solution working in terms of just within column without considering the groups would also be useful for me.)
Ok - part 1 (and trying to avoid loops wherever possible):
Here's some test data:
test_data=data.frame(
group=c(rep("a",100),rep("b",100)),
value=rnorm(200)
)
We'll find the groups:
groups=levels(test_data[,1]) # or unique(test_data[,1]) if it isn't a factor
And we'll calculate the outlier limits (here I'm specifying only 1 sd) - sorry for the loop, but it's only over the groups, not the data:
outlier_sds=1
outlier_limits=sapply(groups,function(g) {
m=mean(test_data[test_data[,1]==g,2])
s=sd(test_data[test_data[,1]==g,2])
return(c(m-outlier_sds*s,m+outlier_sds*s))
})
So we can define the limits for each row of test_data:
test_data_limits=outlier_limits[,test_data[,1]]
And use this to determine the outliers:
outliers=test_data[,2]<test_data_limits[1,] | test_data[,2]>test_data_limits[2,]
(or, combining those last steps):
outliers=test_data[,2]<outlier_limits[1,test_data[,1]] | test_data[,2]>outlier_limits[2,test_data[,1]]
Finally:
test_data_without_outliers=test_data[!outliers,]
EDIT: now part 2 (apply part 1 with a loop over all the columns in the data):
Some test data with more than one column of values:
test_data2=data.frame(
group=c(rep("a",100),rep("b",100)),
value1=rnorm(200),
value2=2*rnorm(200),
value3=3*rnorm(200)
)
Combine all the steps of part 1 into a new function find_outliers that returns a logical vector indicating whether any value is an outlier for its respective column & group:
find_outliers = function(values,n_sds,groups) {
group_names=levels(groups)
outlier_limits=sapply(group_names,function(g) {
m=mean(values[groups==g])
s=sd(values[groups==g])
return(c(m-n_sds*s,m+n_sds*s))
})
return(values < outlier_limits[1,groups] | values > outlier_limits[2,groups])
}
And then apply this function to each of the data columns:
test_groups=test_data2[,1]
test_data_outliers=apply(test_data2[,-1],2,function(d) find_outliers(values=d,n_sds=1,groups=test_groups))
The rowSums of test_data_outliers indicate how many times each row is considered an 'outlier' in the various columns, with respect to its own group:
rowSums(test_data_outliers)
First time asking a question. I tried searching in vain for an answer but can't seem to find exactly what I am looking for.
i have a small (2x2) data frame at the moment:
status weighted.responses
1 control 3.872694
2 exposed 3.713198
What I want to be able to do is subtract 'Exposed' from 'Control' by calling out a specific name, as opposed to [2,2] - [1,2]. Reason being, there will be several more columns added to this data frame as time goes on.
I have tried to transpose the data frame, promote the rows to headers, and then remove the row being used for a header, but then I end up with a vector.
Transposing also seems to turn my data frame to strings for some reason, which is another problem.
I have tried just taking the vector of 'weighted.responses', naming them accordingly, and subtracting, but then I run into a problem of having the new variable end up being a named number, which I don't want. At that point it just seemed like a waste of time and space for have several different lines for a problem I am sure.
I feel like I am running circles around a very simple solution, but I can't figure it out.
I am very appreciative of your time, and apologies for the formatting.
There are a couple of different ways you can do this using dplyr/tidyverse. Note that functions like spread tend to work better for reshaping dataframes than t(), which turns your dataframe into a matrix and coerces all values to the same type. Examples of things you can do:
library(tidyverse)
df = data.frame(
status = c("Control", "Exposed"),
response = c(3.87, 3.71)
)
df %>% spread(status, response) %>% summarize(diff = Control - Exposed)
# Output:
diff
1 0.16
df %>%
summarize(diff = response[status == "Control"] - response[status == "Exposed"])
# Output:
diff
1 0.16
1) subtraction This will subtract row 1 from row 2 even if there are more than 2 columns. It is assumed that the other columns are numeric and are to be differenced as well. Note that -1 here means all columns except the first.
DF[2, -1] - DF[1, -1]
2) rownames Another way to do it is to convert the first column to row names and then do the subtraction:
DF1 <- DF[-1]
rownames(DF1) <- DF[[1]]
DF1["exposed", ] - DF1["control", ]
3) lapply This would also work:
data.frame(lapply(DF[-1], diff))
I've posted a sample of the data I'm working with here.
"Parcel.." is the main indexing variable and there are good amount of duplicates. The duplicates are not consistent in all of the other columns. My goal is to aggregate the data set so that there is only one observation of each parcel.
I've used the following code to attempt summing numerical vectors:
aggregate(Ap.sample$X.11~Ap.sample$Parcel..,FUN=sum)
The problem is it removes everything except the parcel and the other vector I reference.
My goal is to use the same rule for certain numerical vectors (sum) (X.11,X.13,X.15, num_units) of observations of that parcelID, a different rule (average) for other numerical vectors (Acres,Ttl_sq_ft,Mtr.Size), and still a different rule (just pick one name) for the character variables (pretend there's another column "customer.name" with different values for the same unique parcel ID, i.e. "Steven condominiums" and "Stephen apartments"), and to just delete the extra observations for all the other variables.
I've tried to use the numcolwise function but that also doesn't do what I need.
My instinct would be to specify the columns I want to sum and the columns I want to take the average like so:
DT<-as.data.table(Ap.sample)
sum_cols<-Ap.05[,c(10,12,14)]
mean_cols<-Ap.05[,c(17:19)]
and then use the lapply function to go through each observation and do what I need.
df05<-DT[,lapply(.SD,sum), by=DT$Parcel..,.SDcols=sum_cols]
df05<-DT[,lapply(.SD,mean),by=DT$Parcel..,.SDcols=mean_cols]
but that spits out errors on the first go. I know there's a simpler work around for this than trying to muscle through it.
You could do:
library(dplyr)
df %>%
# create an hypothetical "customer.name" column
mutate(customer.name = sample(LETTERS[1:10], size = n(), replace = TRUE)) %>%
# group data by "Parcel.."
group_by(Parcel..) %>%
# apply sum() to the selected columns
mutate_each(funs(sum(.)), one_of("X.11", "X.13", "X.15", "num_units")) %>%
# likewise for mean()
mutate_each(funs(mean(.)), one_of("Acres", "Ttl_sq_ft", "Mtr.Size")) %>%
# select only the desired columns
select(X.11, X.13, X.15, num_units, Acres, Ttl_sq_ft, Mtr.Size, customer.name) %>%
# de-duplicate while keeping an arbitrary value (the first one in row order)
distinct(Parcel..)
I am very new to R and I am struggling to understand how to omit NA values in a specific way.
I have a large dataframe with several columns (up to 40) and rows (up to 200ish). I want to use data from one of the columns to do simple stats (wilcox.test, boxplot, etc): one column will have a continuous variable (V1), while the other has a binary variable (V2; 0 or 1), which divides 2 groups. I want to do this for the continuous variable using different V2 binary variables, which are unrelated. I organized this data in Excel, saved it as CSV and am using R Studio.
All these columns have interspersed NA values and when I use omit.na, it just takes off every single row where a NA value is present, which takes away an awful load of data. Is there any simple solution to do this? I have seen some answers to similar topics, but none seems quite exactly what I need to do.
Many thanks for any answer. Again, I am a baby-level newbie to R and may have overlooked something in other topics!
If I understand, you want to apply to function to a pair of column each time.
wilcox.test(V1,V2)
wilcox.test(V1,V3)...
Where Vi have no missing values. I would do something like this :
## use complete.cases to assert that you have no missing values
## for the selected pair
apply_clean <-
function(x,y){
ok <- complete.cases(x, y)
wilcox.test(x[ok],dat$V1[ok])
}
## apply this function to all columns after removing the continuous column
lapply(subset(dat,select=-V1),apply_clean,y=dat$V1)
You can manipulate the data.frame to omit based on any rules you like. For example:
dirty.frame <- data.frame(col1 = c(1,2,3,4,5,6,7,NA,9,10), col2 = c(10, 9, 8, 7,6,5,4,3,2,1))
cleaned.frame <- dirty.frame[!is.na(dirty.frame$col1),]
This code used is.na() to test if a row in a specific column is na. The ! means not, and will omit that row.