output results from a subset of data - r

Take the simple data as:
A <- (1:100)
B <- (4:103)
C <- (100:199)
D <- (1000:1099)
df <- data.frame(A,B,C,D)
unique_set <- c('B','C')
Now it is simple to create a subset of df accounting for the unique_set variables, using
df[unique_A]
But lets say, i want only a specific row of numbers. Or more specifically for a specific A value. If we try this, an error occurs.
df[unique_A][df$A == 78]
Or
df[unique_A & df$A == 78]
So what I want it to output is what this returns
df[unique_A][78,]
While A is in sequential order, the following code work. But I want to know how the user can specific set conditions (ie. the A value) whilst account for our unique_set requirement at the same time?
Must one include A with the unique_set command?

Basically data.frame subset is looks like:
df[condition on rows, condition on columns]
So in your case, you want to select all the rows where column A is 78 and at the same time select only columns specified on unique_set:
df[df$A == 78, unique_set]
Try to play with these examples:
df[df$A == 78, c("B", "C")]
df[df$A == 78, c(2, 3)]
df[78, c(1:3)]

Related

Accessing List Value by looking up value with same index

I was following this blog post:
https://www.robert-hickman.eu/post/dixon_coles_1/
And in a number of places, he gets a value from a list by putting in a value with the equivalent index, rather like using it as a key in a python dictionary. This would be an example in the link:
What I understand he's done is basically this:
teams <- c("a","b","c")
values <- c(1,2,3)
example_df <- data.frame(teams,values)
example_list <- as.list(example_df)
example_list$values[a]
Error: object 'a' not found
But I get an NA value for this - am I missing something ?
Thanks in advance !
The way a list works in R, makes it not really practical to address a list like that, as the values in the list aren't associated like that.
Which leads to this:
teams <- c("a","b","c")
values <- c(1,2,3)
example_df <- data.frame(teams,values)
example_list <- as.list(example_df)
#Gives NULL
example_list[example_list$teams == "a"]$values
#Gives 1, 2, 3
example_list[example_list$teams == "b"]$values
#Gives NULL
example_list[example_list$teams == "b"]$values
You can see that this wouldn't work, because the syntax you would expect to work in this case, throws an error "incorrect amount of dimensions":
example_list[example_list$teams == "b", ]$values
However, it is really easy to address a data frame, or any matrix like structure in the way you want to:
teams <- c("a","b","c")
values <- c(1,2,3)
example_df <- data.frame(teams,values)
#Gives 1
example_df[example_df$teams == "a", ]$values
#Gives 2
example_df[example_df$teams == "b", ]$values
#Gives 3
example_df[example_df$teams == "b", ]$values
What I think is happening in the tutorial you shared is something else. As far as I can see, there are no names passed through to the list, but variables. It is not giving the value of a higher dimensional thing, but rather the value of the list itself.
That also makes a lot more sense, as that is what the syntax is doing. "teams[1]" Simply returns the first value in the list called "teams" (even if that value is a vector or whatever) Of course, teams[i], where i is a variable, also works. What I mean is this:
teams = list(A = 1, B = 2, C = 3, D = 4)
#Gives A
teams[1]
If you want to understand why one of them works and the other one doesn't, here is both together. Throw it in RStudio, and look through the Environment.
## One dimensional
teams = list(A = "a", B = "very", C = "good", D = "example")
#Gives "very"
teams[2]
## Two dimensional
teams <- c("a","b","c")
values <- c(1,2,3)
teams2 <- list(teams, values)
#Gives "a, b, c"
teams2[1]
#Gives NULL
teams2[3]

Preventing variable renaming while creating new dataframe from another data frame

I have a very simple dataset like this,
a <- c(29, 10, 29)
b <- c(32, 23, 43)
c <- c(33,22,1)
df1 <- data.frame(a, b, c)
I want to create a new data frame from vector a and c from df1. I am runing the following command,
df2 <- data.frame(df1$a, df1$c)
It is creating a data frame with variable name df.aand df.c. Is there any way I can have the variable name exactly like what I have in df1?
df2 <- data.frame(a=df1$a, c=df1$c)
a b
1 29 33
2 10 22
3 29 1
I assume your a, b, c variables are not directly available anymore
colnames(df2) <- c("a", "c")
should do the trick?
df1[,c("a","c")]
In case you select only column: df1[,"a",drop=FALSE].
Always include drop=FALSE to handle the general case:
selectedColumns <- c("a","c")
df1[, selectedColumns, drop=FALSE]
If your real application is more complex than just taking a subset (which seems an obviously good solution), you can use setNames (here it doesn't make much sense, but it could help if you are trying to automatically rename the data frame at construction...):
df2 <- setNames(df1[, c('a', 'b')], names(df1[, c('a', 'b')]) )

How to extract rows of a data frame between two characters

I've got some poorly structured data I am trying to clean. I have a list of keywords I can use to extract data frames from a CSV file. My raw data is structured roughly as follows:
There are 7 columns with values, the first columns are all string identifiers, like a credit rating or a country symbol (for FX data), while the other 6 columns are either a header like a percentage change string (e.g. +10%) or just a numerical value. Since I have all this data lumped together, I want to be able to extract data for each category. So for instance, I'd like to extract all the rows between my "credit" keyword and my "FX" keyword in my first column. Is there a way to do this in either base R or dplyr easily?
eg.
df %>%
filter(column1 = in_between("credit", "FX"))
Sample dataframe:
row 1: c('random',-1%', '0%', '1%, '2%')
row 2: c('credit', NA, NA, NA, NA)
row 3: c('AAA', 1,2,3,4)
...
row n: c('FX', '-1%', '0%', '1%, '2%')
And I would want the following output:
row 1: c('credit', -1%', '0%', '1%, '2%')
row 2: c('AAA', 1,2,3,4)
...
row n-1: ...
If I understand correctly you could do something like
start <- which(df$column1 == "credit")
end <- which(df$column1 == "FX")
df[start:(end-1), ]
Of course this won't work if "credit" or "FX" is in the column more than once.
Using what Brian suggested:
in_between <- function(df, start, end){
return(df[start:(end-1),])
}
Then loop over the indices in
dividers = which(df$column1 %in% keywords == TRUE)
And save the function outputs however one would like.
lapply(1:(length(dividers)-1), function(x) in_between(df, start = dividers[x], end = dividers[x+1]))
This works. Messy data so I still have the annoying case where I need to keep the offset rows.
I'm still not 100% sure what you are trying to accomplish but does this do what you need it to?
set.seed(1)
df <- data.frame(
x = sample(LETTERS[1:10]),
y = rnorm(10),
z = runif(10)
)
start <- c("C", "E", "F")
df2 <- df %>%
mutate(start = x %in% start,
group = cumsum(start))
split(df2, df2$group)

R Drop rows according to various criteria

I need some help regarding how to start the implementation a problem in R. I have a data frame with rows which are grouped by the variable 'id'. For each 'id' I want to keep only one row. However, I have a number of criteria which specify which rows to drop.
These are some of my criteria:
I want to keep one random row within each group 'id' which has 'text' != NA (there might be several such rows); and I also want to keep all columns of this row, this is also the case for all following criteria.
If all rows in a group have 'text' == NA, then I want to keep one random row which has the variable 'check' == T (there might be several such rows)
If all rows in a group have 'text' == NA and 'check' == F, then I want to keep the row which has the variable 'newtext' which meets the condition !(grepl("None",df$newtext))
I can also provide a dataset if this makes it more clear. However, my most important issue is that I do not know how to implement this logic of dropping rows according to an ordered number of criteria.
It would be nice, if anyone can tell me how to implement such a code.
Thank you!
This would be an example dataset:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3),
text=c("asd",NA,"asd",NA,NA,NA,NA,NA,NA),
check = c(T,F,T,T,T,F,F,F,F),
newtext =
c("as","as","as","das","das","None","qwe","qwe2","None"),
othervars = c(1,2,3,45,5,6,6,7,1))
As an output, I want to keep the following rows:
row 1 or 3
row 4 or 5
row 7 or 8
The column othervars should be kept as well as I need this information later on.
Hope this makes it a bit clearer.
Alright, I've got something. I'm using filter() from dplyr to subset with unknown NA, because I ran into problems using either subset() or common df[,] subsetting from base R.
Data:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3),
text=c("asd",NA,"asd",NA,NA,NA,NA,NA,NA),
check = c(T,F,T,T,T,F,F,F,F),
newtext =
c("as","as","as","das","das","None","qwe","qwe2","None"),
othervars = c(1,2,3,45,5,6,6,7,1))
Initiating new empty dataframe:
df2 <- df[0,]
Loop to sample one row per id:
library(dplyr)
for(i in unique(df$id)){
temp <- filter(df, id == i)
if(nrow(filter(temp, !is.na(text))) > 0){
temp <- filter(temp, !is.na(text))
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}else if(nrow(filter(temp, check)) > 0){
temp <- filter(temp, check)
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}else{
temp <- filter(temp, !(grepl("None",temp$newtext)))
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}
}
Output example:
> df2
id text check newtext othervars
2 1 asd TRUE as 1
1 2 <NA> TRUE das 45
3 3 <NA> FALSE qwe 6
Greetings.
Edit: Ignore the row numbers on the left, they are residuals from the different subsets within the loop.

dynamic subsetting in r

I have a data set that is something like the following, but with many more columns and rows:
a<-c("Fred","John","Mindy","Mike","Sally","Fred","Alex","Sam")
b<-c("M","M","F","M","F","M","M","F")
c<-c(40,35,25,50,25,40,35,40)
d<-c(9,7,8,10,10,9,5,8)
df<-data.frame(a,b,c,d)
colnames(df)<-c("Name", "Gender", "Age", "Score")
I need to create a function that will let me sum the scores for selected subsets of the data. However, the subsets selected may have different numbers of variables each time. One subset could be Name=="Fred" and another could be Gender == "M" & Age == 40. In my actual data set, there could be up to 20 columns used in a selected subset, so I need to make this as general as possible.
I tried using a sapply command that included eval(parse(text=...), but it takes a long time with only a sample of 20,000 or so records. I'm sure there's a much faster way, and I'd appreciate any help in finding it.
There are several ways to represent these two variables. One way is as two distinct objects, another is as two elements in a list.
However, using a named list might be the easiest:
# df is a function for the F distribution. Avoid using "df" as a variable name
DF <- df
example1 <- list(Name = c("Fred")) # c() not needed, used for emphasis
example2 <- list(Gender = c("M"), Age=c(40, 50))
## notice that the key portion is `DF[[nm]] %in% ll[[nm]]`
subByNmList <- function(ll, DF, colsToSum=c("Score")) {
ret <- vector("list", length(ll))
names(ret) <- names(ll)
for (nm in names(ll))
ret[[nm]] <- colSums(DF[DF[[nm]] %in% ll[[nm]] , colsToSum, drop=FALSE])
# optional
if (length(ret) == 1)
return(unlist(ret, use.names=FALSE))
return(ret)
}
subByNmList(example1, DF)
subByNmList(example2, DF)
lapply( subset( df, Gender == "M" & Age == 40, select=Score), sum)
#$Score
#[1] 18
I could have writtne just :
sum( subset( df, Gender == "M" & Age == 40, select=Score) )
But that would not generalize very well.

Resources