R Drop rows according to various criteria - r

I need some help regarding how to start the implementation a problem in R. I have a data frame with rows which are grouped by the variable 'id'. For each 'id' I want to keep only one row. However, I have a number of criteria which specify which rows to drop.
These are some of my criteria:
I want to keep one random row within each group 'id' which has 'text' != NA (there might be several such rows); and I also want to keep all columns of this row, this is also the case for all following criteria.
If all rows in a group have 'text' == NA, then I want to keep one random row which has the variable 'check' == T (there might be several such rows)
If all rows in a group have 'text' == NA and 'check' == F, then I want to keep the row which has the variable 'newtext' which meets the condition !(grepl("None",df$newtext))
I can also provide a dataset if this makes it more clear. However, my most important issue is that I do not know how to implement this logic of dropping rows according to an ordered number of criteria.
It would be nice, if anyone can tell me how to implement such a code.
Thank you!
This would be an example dataset:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3),
text=c("asd",NA,"asd",NA,NA,NA,NA,NA,NA),
check = c(T,F,T,T,T,F,F,F,F),
newtext =
c("as","as","as","das","das","None","qwe","qwe2","None"),
othervars = c(1,2,3,45,5,6,6,7,1))
As an output, I want to keep the following rows:
row 1 or 3
row 4 or 5
row 7 or 8
The column othervars should be kept as well as I need this information later on.
Hope this makes it a bit clearer.

Alright, I've got something. I'm using filter() from dplyr to subset with unknown NA, because I ran into problems using either subset() or common df[,] subsetting from base R.
Data:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3),
text=c("asd",NA,"asd",NA,NA,NA,NA,NA,NA),
check = c(T,F,T,T,T,F,F,F,F),
newtext =
c("as","as","as","das","das","None","qwe","qwe2","None"),
othervars = c(1,2,3,45,5,6,6,7,1))
Initiating new empty dataframe:
df2 <- df[0,]
Loop to sample one row per id:
library(dplyr)
for(i in unique(df$id)){
temp <- filter(df, id == i)
if(nrow(filter(temp, !is.na(text))) > 0){
temp <- filter(temp, !is.na(text))
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}else if(nrow(filter(temp, check)) > 0){
temp <- filter(temp, check)
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}else{
temp <- filter(temp, !(grepl("None",temp$newtext)))
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}
}
Output example:
> df2
id text check newtext othervars
2 1 asd TRUE as 1
1 2 <NA> TRUE das 45
3 3 <NA> FALSE qwe 6
Greetings.
Edit: Ignore the row numbers on the left, they are residuals from the different subsets within the loop.

Related

R: insert rows at specific places in dataframe

I can't seem to find an example to help me solve a particular problem in R. I have a data frame that looks like this:
tmp = data.frame(group = c(rep("A", 5), rep("B",2), rep("C",6)), value = rnorm(13))
In reality I have thousands of columns and rows with many different values for group. The rows in the data frame are ordered by group.
I'd like to insert a new row above the first occurrence of each group. I'd also like for these new rows to only contain a value (the same value) in the first column (although I can make do if columns 2:ncol(tmp) contain NAs). Using the example data frame above, the end result should look like this:
group value
GROUP
A -1.7596279
A -0.8273928
A -0.3515738
A -0.7547999
A 0.5700747
GROUP
B -1.9676482
B 0.3996858
GROUP
C 0.1047832
C 0.5903711
C -1.3687259
C 0.3688415
C 1.3674403
C 0.8880089
Is there a way to do this? I can come up with a list of rows containing the first instance of each group. I was originally thinking that I could use this information to define where new rows should be inserted, but not sure if this is the best way to go.
I tried to create a function that does what you want it to do:
addEmptyRows <- function(D)
{
output <- tmp
i <- 1
while (i < NROW(output)) {
if(output$group[i] != output$group[i+1])
{
output <- rbind(output[1:i,],c("GROUP","NA"),output[(i+1):NROW(output),])
i <- i+1
}
i <- i+1
}
return(rbind(c("GROUP","NA"),output))
}
If you apply this function to your dataframe:
addEmptyRows(tmp)
It gives you the desired dataframe. Does this help you?
You could use something like this:
tmp <- data.frame(group = c(rep("A", 5), rep("B",2), rep("C",6)), value = rnorm(13))
divider <- data.frame(group = "GROUP", value = NA)
do.call(rbind, unlist(lapply(split(tmp, tmp$group),
function(x) list(divider, x)), recursive = F))

Count number of entries in each column with result in dataframe

I have a dataframe with many columns. I want to count the number of times something is entered into each column.
#Example data
Gender<-c("","Male","Male","","Female","Female")
location<-c("UK","France","USA","","","")
dataset<-data.frame(Gender,location, stringsAsFactors = FALSE)
There are 4 entries in the gender column and 3 entries in the location column.
I want the results to be in a dataframe such as:
result<-data.frame(Results=c("Gender","location"), Totals=c(4,3))
Can anyone suggest an approach to do this?
You can use the namesof datasetas one column for resultand calculate the Totals by counting how often grep matches anything that is a character (as opposed to nothing in an empty cell):
result <- data.frame(
Results = names(dataset),
Totals = sapply(dataset, function(x) length(grep(".", x)))
)
rownames(result) <- NULL
Result:
result
Results Totals
1 Gender 4
2 location 3
A base R option using stack + colSums
setNames(
rev(stack(colSums(dataset != ""))),
c("Results", "Total")
)
gives
Results Total
1 Gender 4
2 location 3
This should work for you:
ngen <- sum(dataset$Gender != "") #sum number entries in column that are not empty
nloc <- sum(dataset$location != "") #sam thing
Totals <- c(ngen,nloc)
result<-data.frame(Results=c("Gender","location"), Totals)
You can simplify some of the steps if you want, but that would be the detailed way.

Substituting or summing based on condition

I have a dataset that looks something like this
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5))
And I would like to create a cumulative sum of "Val". I know how to do the simple cumulative sum
df <- df %>% group_by(id) %>% mutate(cumval=cumsum(Val))
However, I would like my final data to look like this
final <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5),
"cumval" = c(2,5,6,11,2,7,5,10))
The basic idea is that when two "Val"'s are of the same "Group" the one happening later (Year) substitutes the previous one.
For instance, in the sample dataset, observation 3 has a "cumval" of 6 rather than 8 because of the "Val" at time 1972 replaced the "Val" at time 1970. similarly for Beta.
I thank you in advance for your help
In my head, this requires a for loop. First we split the dataframe by the id column into a list of two. Then we create two empty lists. In the og list, we will put the row where the first unique non NA group identifier occurs. For alpha this is the first row and for Beta this is the second row. We will use this to subtract from the cumulative sum when the value gets substituted.
mylist <- split(df, f = df$id)
og <- list()
vals <- list()
df_num <- 1
We shall use a nested loop, the outer loop loops over each object (dataframe in this case) in the list and the inner loop loops over each value in the Group column.
We need to keep track of the row numbers, which we do with the r variable. We initially set it to 0 outside the for loop so we add 1. First we check if we are in the first row of the data frame, in which case the cumulative sum is simply equal to the value in the first row of the Val column. Then within the if test, we use another if test to check if the Group id is an NA. If it isn't then this is the first occurrence of the number that will indicate a substitution of the current value if this number appears again. So we save the number to the temporary variable temp. We also extract and save the row that contains the value to the og list.
After this it, goes to the next iteration. We check if the current Group value is NA. If it is, then we just add the value to the cumulative sum. If it isn't equal to NA, we check if the value is NA and is equal to the value stored in temp. If both are true, then this means we need to substitute. We extract the original value stored in the og list and save it as old. We then subtract the old value from the cumulative sum and add the current value. We also replace the orginal value in og with the current replacement value. This is because if the value needs to replaced again, we will need to subtract the current value and not the original value.
If j is NA but it is not equal to temp, then this is a new instance of Group. So we save the row with the original value to og list, and save the Group. The sum continues as normal as this is not an instance of replacing a value. Note that the variable x that is used to count the elements in the og list is only incremented when a new occurrence is added to the list. Thus, og[[x-1]] will always be the replacement value.
for (my_df in mylist) {
x <- 1
r <- 0
for (j in my_df$Group) {
r <- r + 1
if (r == 1) {
vals[[1]] <- my_df$Val[1]
if (is.na(j)==FALSE) {
og[[x]] <- df[r, c('Group', 'Val'), drop = FALSE]
temp <- j
x <- x + 1
}
next
}
if (is.na(j)==TRUE) {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
} else if (is.na(j)==FALSE & j==temp) {
old <- og[[x-1]]
old <- old[,2]
vals[[r]] <- vals[[r-1]] - old + df$Val[r]
og[[x-1]] <- df[r, c('Group', 'Val'), drop = FALSE]
} else {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
og[[x]] <- my_df[r, c('Group', 'Val')]
temp <- j
x <- x + 1
}
}
cumval <- unlist(vals) %>% as.data.frame()
colnames(cumval) <- 'cumval'
my_df <- cbind(my_df, cumval)
mylist[[df_num]] <- my_df
df_num <- df_num + 1
}
Lastly, we combine the two dataframes in the list by binding them on rows with bind_rows from the dplyr package. Then I check if the Final dataframe is identical to your desired output with identical() and it evaluates to TRUE
final_df <- bind_rows(mylist)
identical(final_df, final)
[1] TRUE

Removing Columns in one data frame based on values of another - conditional looping

Hi so I am having issues looping through my data frame and removing columns based on the condition that suppress = 1. So the loop would need to go through every column of df1 and remove the columns suppress = 1 for that same variable. It would need to determine that the specific row of suppress = 1 has the same variable in both df's.
So there are two data frames. df1 contains all the data and df2 contains the conditions based on the variables of df1.
df1 <- data.frame("ID" = c(1,2,3,4,5), "Age" = c(19,50,46,32,28))
df2 <- data.frame("Variable" = c("ID", "Age"), "Suppress" = c(1,0))
The main issue I am having is that the loop I currently have works for when I make a data frame such as df1 and df2, but not for when I import a csv file and use that data.
Could it be the format of the data frames or does the loop need to be adjusted to work for the csv imports? I suspect the latter.
Here is the loop I currently have:
for(i in names(df1)){
if(df2$Variable == names(df1[i]) & df2$Suppress == 1){
df1[i] <- NULL
}
}
Another version... essentially the same
for(i in names(df1)){
if(df2$Variable %in% names(df1[i]) & df2$Suppress == 1){
df1[i] <- NULL
}
}
I cannot post a csv here, but I recommend trying to run the above code with an imported csv file similar to df1 and df2.
Note: Both the df1 and df2 are being imported as a csv file.
Recap: Why does the current loop not work with imported csv data and what are alternative ways to removing the columns based on df2's suppress variable.
Thanks
I believe the logic in your posted code is not right, you should be comparing each value of df2$Variable to names(df1).
for(i in seq_along(nrow(df2))){
if(df2$Variable[i] %in% names(df1) && df2$Suppress[i] == 1){
df1[i] <- NULL
}
}
df1
# Age
#1 19
#2 50
#3 46
#4 32
#5 28
A vectorized way, with no loops at all is the following.
inx <- (names(df1) %in% df2$Variable) & (df2$Suppress == 1)
df1[!inx]
# Age
#1 19
#2 50
#3 46
#4 32
#5 28

How to extract rows of a data frame between two characters

I've got some poorly structured data I am trying to clean. I have a list of keywords I can use to extract data frames from a CSV file. My raw data is structured roughly as follows:
There are 7 columns with values, the first columns are all string identifiers, like a credit rating or a country symbol (for FX data), while the other 6 columns are either a header like a percentage change string (e.g. +10%) or just a numerical value. Since I have all this data lumped together, I want to be able to extract data for each category. So for instance, I'd like to extract all the rows between my "credit" keyword and my "FX" keyword in my first column. Is there a way to do this in either base R or dplyr easily?
eg.
df %>%
filter(column1 = in_between("credit", "FX"))
Sample dataframe:
row 1: c('random',-1%', '0%', '1%, '2%')
row 2: c('credit', NA, NA, NA, NA)
row 3: c('AAA', 1,2,3,4)
...
row n: c('FX', '-1%', '0%', '1%, '2%')
And I would want the following output:
row 1: c('credit', -1%', '0%', '1%, '2%')
row 2: c('AAA', 1,2,3,4)
...
row n-1: ...
If I understand correctly you could do something like
start <- which(df$column1 == "credit")
end <- which(df$column1 == "FX")
df[start:(end-1), ]
Of course this won't work if "credit" or "FX" is in the column more than once.
Using what Brian suggested:
in_between <- function(df, start, end){
return(df[start:(end-1),])
}
Then loop over the indices in
dividers = which(df$column1 %in% keywords == TRUE)
And save the function outputs however one would like.
lapply(1:(length(dividers)-1), function(x) in_between(df, start = dividers[x], end = dividers[x+1]))
This works. Messy data so I still have the annoying case where I need to keep the offset rows.
I'm still not 100% sure what you are trying to accomplish but does this do what you need it to?
set.seed(1)
df <- data.frame(
x = sample(LETTERS[1:10]),
y = rnorm(10),
z = runif(10)
)
start <- c("C", "E", "F")
df2 <- df %>%
mutate(start = x %in% start,
group = cumsum(start))
split(df2, df2$group)

Resources