I have a data.frame setup like this:
df <- data.frame(units = c(1.5, -1, 1.4),
what = c('Num1', 'Num2', 'Num3'))
Which gives me something like this:
units what
1 1.500000 Num1
2 -1000000 Num2
3 1.400000 Num3
I want to able to remove the entire row if the number in the first column is -1. So Ideally loop through the whole dataframe and remove the rows that have -1 in the unit column. I've been trying things like this:
if(CONDITION TO REMOVE) {
print("deleting function...")
df <- df[-c(df[,'Num2']),]
}
But it deletes everything in the rest of the df. I only want to delete that one row (and the entire row).
Thanks in advance.
newdf <- df[-which(df[,1] ==-1),]
newdf is df without the rows containing -1 in the first column.
You can use dplyr to better suit your needs:
df.new <- df %>% filter(units != -1)
Or you can do this using base R
df.new <- df[df$units != -1, ]
Related
My data is imported into R as a list of 60 tibbles each with 13 columns and 8 rows. I want to detect outliers defined as 2*sd by comparing each value in column "2" to the mean of all values of column "2" in the same row.
I know that I am on a wrong path with these lines, as I am not comparing the single values
lapply(list, function(x){
if(x$"2">(mean(x$"2")) + (2*sd(x$"2"))||x$"2"<(mean(x$"2")) - (2*sd(x$"2"))) {}
})
Also I was hoping to replace all values that are thus identified as outliers by the corresponding mean calculated from the 60 values in the same position as the outlier while keeping everything else, but I am also quite unsure how to do that.
Thank you!
you haven't added an example of your code so I've made a quick and simple example to demonstrate my answer. I think this would be much more straightforward logic if you first combine the list of tibbles into a single tibble. This allows you to do everything you want in a simple dplyr pipe, ultimately identifying outliers by 1's in the 'outlier' column:
library(tidyverse)
tibble1 <- tibble(colA = c(seq(1,20,1), 150),
colB = seq(0.1,2.1,0.1),
id = 1:21)
tibble2 <- tibble(colA = c(seq(101,120,1), -150),
colB = seq(21,41,1),
id = 1:21)
# N.B. if you don't have an 'id' column or equivalent
# then it makes it a lot easier if you add one
# The 'id' column is essentially shorthand for an index
tibbleList <- list(tibble1, tibble2)
joinedTibbles <- bind_rows(tibbleList, .id = 'tbl')
res <- joinedTibbles %>%
group_by(id) %>%
mutate(meanA = mean(colA),
sdA = sd(colA),
lowThresh = meanA - 2*sdA,
uppThresh = meanA + 2*sdA,
outlier = ifelse(colA > uppThresh | colA < lowThresh, 1, 0))
I can't seem to find an example to help me solve a particular problem in R. I have a data frame that looks like this:
tmp = data.frame(group = c(rep("A", 5), rep("B",2), rep("C",6)), value = rnorm(13))
In reality I have thousands of columns and rows with many different values for group. The rows in the data frame are ordered by group.
I'd like to insert a new row above the first occurrence of each group. I'd also like for these new rows to only contain a value (the same value) in the first column (although I can make do if columns 2:ncol(tmp) contain NAs). Using the example data frame above, the end result should look like this:
group value
GROUP
A -1.7596279
A -0.8273928
A -0.3515738
A -0.7547999
A 0.5700747
GROUP
B -1.9676482
B 0.3996858
GROUP
C 0.1047832
C 0.5903711
C -1.3687259
C 0.3688415
C 1.3674403
C 0.8880089
Is there a way to do this? I can come up with a list of rows containing the first instance of each group. I was originally thinking that I could use this information to define where new rows should be inserted, but not sure if this is the best way to go.
I tried to create a function that does what you want it to do:
addEmptyRows <- function(D)
{
output <- tmp
i <- 1
while (i < NROW(output)) {
if(output$group[i] != output$group[i+1])
{
output <- rbind(output[1:i,],c("GROUP","NA"),output[(i+1):NROW(output),])
i <- i+1
}
i <- i+1
}
return(rbind(c("GROUP","NA"),output))
}
If you apply this function to your dataframe:
addEmptyRows(tmp)
It gives you the desired dataframe. Does this help you?
You could use something like this:
tmp <- data.frame(group = c(rep("A", 5), rep("B",2), rep("C",6)), value = rnorm(13))
divider <- data.frame(group = "GROUP", value = NA)
do.call(rbind, unlist(lapply(split(tmp, tmp$group),
function(x) list(divider, x)), recursive = F))
I have a dataframe which includes a column of identifier codes. Where the code ends in a 0, I want to replace it with a 1.
Through a lot of trial and error I have a for loop which almost works. It works when there is only one code which ends in a 0 and it's in the last row of the dataframe. If there's another row of data, the for loop doesn't produce the desired output.
library(stringr)
df_a <- data.frame(a = c("02.1.1", "02.1.1.0"))
df_b <- data.frame(a = c("02.1.1", "02.1.1.0", "02.1.2"))
for (i in nrow(df_a)){
df_a$adj <- ""
df_a$code_adj <- ""
if (str_sub(df_a[i, "a"], -1, -1) == "0"){
df_a[i, "adj"] <- "1"
df_a[i, "code_adj"] <- paste0(str_sub(df_a[i, "a"], 1, -2), df_a[i, "adj"])
}
}
When I run the for loop on the dataframe df_a, it produces the desired result. When I run it on df_b it does not.
I'm open to better way's of approaching this problem but I would also like to know why the for loop behaves as it does on the different dataframes.
We can create a function with sub and reuse it on multiple datasets. Match the 0 at the end ($) of the string and replace with 1 for the specific column in the dataset, update the column and return the dataset
f1 <- function(dat, colNm) {
dat[[colNm]] <- sub("0$", "1", dat[[colNm]])
dat
}
f1(df_a, "a")
# a
#1 02.1.1
#2 02.1.1.1
f1(df_b, "a")
# a
#1 02.1.1
#2 02.1.1.1
#3 02.1.2
could you not use the stringr package and do something like
df_b <- str_replace(df_b$a, "0$", "1")
this looks for the 0 at the end of the string and replaces it with a 1. Just note that you have to do the conversion to a character as it does not work on factors using
df_b$a <- as.character(df_b$a)
I've got some poorly structured data I am trying to clean. I have a list of keywords I can use to extract data frames from a CSV file. My raw data is structured roughly as follows:
There are 7 columns with values, the first columns are all string identifiers, like a credit rating or a country symbol (for FX data), while the other 6 columns are either a header like a percentage change string (e.g. +10%) or just a numerical value. Since I have all this data lumped together, I want to be able to extract data for each category. So for instance, I'd like to extract all the rows between my "credit" keyword and my "FX" keyword in my first column. Is there a way to do this in either base R or dplyr easily?
eg.
df %>%
filter(column1 = in_between("credit", "FX"))
Sample dataframe:
row 1: c('random',-1%', '0%', '1%, '2%')
row 2: c('credit', NA, NA, NA, NA)
row 3: c('AAA', 1,2,3,4)
...
row n: c('FX', '-1%', '0%', '1%, '2%')
And I would want the following output:
row 1: c('credit', -1%', '0%', '1%, '2%')
row 2: c('AAA', 1,2,3,4)
...
row n-1: ...
If I understand correctly you could do something like
start <- which(df$column1 == "credit")
end <- which(df$column1 == "FX")
df[start:(end-1), ]
Of course this won't work if "credit" or "FX" is in the column more than once.
Using what Brian suggested:
in_between <- function(df, start, end){
return(df[start:(end-1),])
}
Then loop over the indices in
dividers = which(df$column1 %in% keywords == TRUE)
And save the function outputs however one would like.
lapply(1:(length(dividers)-1), function(x) in_between(df, start = dividers[x], end = dividers[x+1]))
This works. Messy data so I still have the annoying case where I need to keep the offset rows.
I'm still not 100% sure what you are trying to accomplish but does this do what you need it to?
set.seed(1)
df <- data.frame(
x = sample(LETTERS[1:10]),
y = rnorm(10),
z = runif(10)
)
start <- c("C", "E", "F")
df2 <- df %>%
mutate(start = x %in% start,
group = cumsum(start))
split(df2, df2$group)
I need some help regarding how to start the implementation a problem in R. I have a data frame with rows which are grouped by the variable 'id'. For each 'id' I want to keep only one row. However, I have a number of criteria which specify which rows to drop.
These are some of my criteria:
I want to keep one random row within each group 'id' which has 'text' != NA (there might be several such rows); and I also want to keep all columns of this row, this is also the case for all following criteria.
If all rows in a group have 'text' == NA, then I want to keep one random row which has the variable 'check' == T (there might be several such rows)
If all rows in a group have 'text' == NA and 'check' == F, then I want to keep the row which has the variable 'newtext' which meets the condition !(grepl("None",df$newtext))
I can also provide a dataset if this makes it more clear. However, my most important issue is that I do not know how to implement this logic of dropping rows according to an ordered number of criteria.
It would be nice, if anyone can tell me how to implement such a code.
Thank you!
This would be an example dataset:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3),
text=c("asd",NA,"asd",NA,NA,NA,NA,NA,NA),
check = c(T,F,T,T,T,F,F,F,F),
newtext =
c("as","as","as","das","das","None","qwe","qwe2","None"),
othervars = c(1,2,3,45,5,6,6,7,1))
As an output, I want to keep the following rows:
row 1 or 3
row 4 or 5
row 7 or 8
The column othervars should be kept as well as I need this information later on.
Hope this makes it a bit clearer.
Alright, I've got something. I'm using filter() from dplyr to subset with unknown NA, because I ran into problems using either subset() or common df[,] subsetting from base R.
Data:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3),
text=c("asd",NA,"asd",NA,NA,NA,NA,NA,NA),
check = c(T,F,T,T,T,F,F,F,F),
newtext =
c("as","as","as","das","das","None","qwe","qwe2","None"),
othervars = c(1,2,3,45,5,6,6,7,1))
Initiating new empty dataframe:
df2 <- df[0,]
Loop to sample one row per id:
library(dplyr)
for(i in unique(df$id)){
temp <- filter(df, id == i)
if(nrow(filter(temp, !is.na(text))) > 0){
temp <- filter(temp, !is.na(text))
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}else if(nrow(filter(temp, check)) > 0){
temp <- filter(temp, check)
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}else{
temp <- filter(temp, !(grepl("None",temp$newtext)))
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}
}
Output example:
> df2
id text check newtext othervars
2 1 asd TRUE as 1
1 2 <NA> TRUE das 45
3 3 <NA> FALSE qwe 6
Greetings.
Edit: Ignore the row numbers on the left, they are residuals from the different subsets within the loop.