Suppose I have a data frame in R that has names of students in one column and their marks in another column. These marks range from 20 to 100.
> mydata
id name marks gender
1 a1 56 female
2 a2 37 male
I want to divide the student into groups, based on the criteria of obtained marks, so that difference between marks in each group should be more than 10. I tried to use the function table, which gives the number of students in each range from say 20-30, 30-40, but I want it to pick those students that have marks in a given range and put all their information together in a group. Any help is appreciated.
I am not sure what you mean with "put all their information together in a group", but here is a way to obtain a list with dataframes split up of your original data frame where each element is a data frame of the students within a mark range of 10:
mydata <- data.frame(
id = 1:100,
name = paste0("a",1:100),
marks = sample(20:100,100,TRUE),
gender = sample(c("female","male"),100,TRUE))
split(mydata,cut(mydata$marks,seq(20,100,by=10)))
I think that #Sacha's answer should suffice for what you need to do, even if you have more than one set.
You haven't explicitly said how you want to "group" the data in your original post, and in your comment, where you've added a second dataset, you haven't explicitly said whether you plan to "merge" these first (rbind would suffice, as recommended in the comment).
So, with that, here are several options, each with different levels of detail or utility in the output. Hopefully one of them suits your needs.
First, here's some sample data.
# Two data.frames (myData1, and myData2)
set.seed(1)
myData1 <- data.frame(id = 1:20,
name = paste("a", 1:20, sep = ""),
marks = sample(20:100, 20, replace = TRUE),
gender = sample(c("F", "M"), 20, replace = TRUE))
myData2 <- data.frame(id = 1:17,
name = paste("b", 1:17, sep = ""),
marks = sample(30:100, 17, replace = TRUE),
gender = sample(c("F", "M"), 17, replace = TRUE))
Second, different options for "grouping".
Option 1: Return (in a list) the values from myData1 and myData2 which match a given condition. For this example, you'll end up with a list of two data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) x[x$marks >= 30 & x$marks <= 50, ])
Option 2: Return (in a list) each dataset split into two, one for FALSE (doesn't match the stated condition) and one for TRUE (does match the stated condition). In other words, creates four groups. For this example, you'll end up with a nested list with two list items, each with two data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) split(x, x$marks >= 30 & x$marks <= 50))
Option 3: More flexible than the first. This is essentially #Sacha's example extended to a list. You can set your breaks wherever you would like, making this, in my mind, a really convenient option. For this example, you'll end up with a nested list with two list items, each with multiple data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) split(x, cut(x$marks,
breaks = c(0, 30, 50, 75, 100),
include.lowest = TRUE)))
Option 4: Combine the data first and use the grouping method described in Option 1. For this example, you will end up with a single data.frame containing only values which match the given condition.
# Combine the data. Assumes all the rownames are the same in both sets
myDataALL <- rbind(myData1, myData2)
# Extract just the group of scores you're interested in
myDataALL[myDataALL$marks >= 30 & myDataALL$marks <= 50, ]
Option 5: Using the combined data, split the data into two groups: one group which matches the stated condition, one which doesn't. For this example, you will end up with a list with two data.frames.
split(myDataALL, myDataALL$marks >= 30 & myDataALL$marks <= 50)
I hope one of these options serves your needs!
I had the same kind of issue and after researching some answers on stack overflow I came up with the following solution :
Step 1 : Define range
Step 2 : Find the elements that fall in the range
Step 3 : Plot
A sample code is as shown below:
range = NULL
for(i in seq(0, max(all$downlink), 2000)){
range <- c(range, i)
}
counts <- numeric(length(range)-1);
for(i in 1:length(counts)) {
counts[i] <- length(which(all$downlink>=range[i] & all$downlink<range[i+1]));
}
countmax = max(counts)
a = round(countmax/1000)*1000
barplot(counts, col= rainbow(16), ylim = c(0,a))
Related
My data is imported into R as a list of 60 tibbles each with 13 columns and 8 rows. I want to detect outliers defined as 2*sd by comparing each value in column "2" to the mean of all values of column "2" in the same row.
I know that I am on a wrong path with these lines, as I am not comparing the single values
lapply(list, function(x){
if(x$"2">(mean(x$"2")) + (2*sd(x$"2"))||x$"2"<(mean(x$"2")) - (2*sd(x$"2"))) {}
})
Also I was hoping to replace all values that are thus identified as outliers by the corresponding mean calculated from the 60 values in the same position as the outlier while keeping everything else, but I am also quite unsure how to do that.
Thank you!
you haven't added an example of your code so I've made a quick and simple example to demonstrate my answer. I think this would be much more straightforward logic if you first combine the list of tibbles into a single tibble. This allows you to do everything you want in a simple dplyr pipe, ultimately identifying outliers by 1's in the 'outlier' column:
library(tidyverse)
tibble1 <- tibble(colA = c(seq(1,20,1), 150),
colB = seq(0.1,2.1,0.1),
id = 1:21)
tibble2 <- tibble(colA = c(seq(101,120,1), -150),
colB = seq(21,41,1),
id = 1:21)
# N.B. if you don't have an 'id' column or equivalent
# then it makes it a lot easier if you add one
# The 'id' column is essentially shorthand for an index
tibbleList <- list(tibble1, tibble2)
joinedTibbles <- bind_rows(tibbleList, .id = 'tbl')
res <- joinedTibbles %>%
group_by(id) %>%
mutate(meanA = mean(colA),
sdA = sd(colA),
lowThresh = meanA - 2*sdA,
uppThresh = meanA + 2*sdA,
outlier = ifelse(colA > uppThresh | colA < lowThresh, 1, 0))
my question may seem trivial but I can not get a solution.
I have a list containing 5 dataframes named by stock (my real list contains 250 stocks).
Each dataframe has the same number and name of columns:
Date: time series formatted as character
returns: time series of stock returns
DJSTOXX: time series of index returns
EPS: Earnings announcement dates, which contains also NAs, formatted as characters.
For each dataframe I would like to run the evReturn function from the ererpackage, which returns a list with different measures regarding cumulative abnormal returns around certain dates. The main inputs of the function are y which should be a dataframe, containing a column for dates (time series), a column with returns, a column with index returns and a column with the earning announcement dates. As you can see below, this code run for the firm NESN the event analysis for each dates where an announcement has been made. Here I set for each date in the time series equal to the EPS dates, omitting NAs, do the function evReturn. It saves the output in hh3 which is a list where for each EPS date the event return is computed.
hh3 <- list()
for (i in na.omit(dflist$NESN$Date[dflist$NESN$Date == dflist$NESN$EPS])){
hh3[[i]] <- evReturn(y = dflist$NESN, firm = "return", event.date = i, y.date = "Date",
index = "DJSTOXX", event.win = 2, est.win = 50, digits = 3)
}
Now my question is:
How can I set it in order to run this code for each stock in the list? Therefore, I am expecting the resulting list to have 5 list (1 for each stock) with a list for each EPS dates.
As you can see in the evReturn function I explicitly set $NESN but I want to set it as for each dataframe in dflist do the function.
I have tried with lapply the following:
lapply(dflist, function(x){
for (i in na.omit(x$Date[x$Date == x$EPS])){
dflist2[[i]] <- evReturn(y = x, firm = "return", event.date = i, y.date = "Date",
index = "DJSTOXX", event.win = 2, est.win = 50, digits = 3)
}})
But it returns:
Error in xj[i] : only 0's may be mixed with negative subscripts
I thus tried to nest 2 loops as:
for(j in seq_along(dflist)){
for (i in na.omit(dflist$j$Date[dflist$j$Date == dflist$j$EPS])){
hh7[[j]][i] <- evReturn(y = dflist$j, firm = "return", event.date = i, y.date = "Date",
index = "DJSTOXX", event.win = 2, est.win = 50, digits = 3)
}}
But it returns hh7 as a list of length 0.
Any help is highly appreciated since it seems I am missing something.
Thank you
Hard to be sure how to help because you don't provide any sample data, but you should be able to do something like this:
get_ev_return <- function(d) {
dates = na.omit(d[d$Date = d$EPS, "Date"])
lapply(dates, \(date) {
evReturn(y=d,firm="return", event_date=date, y.date = "Date", index="DJSTOXX", event.win=2, est.win=50, digits=3)
})
}
lapply(dflist, get_ev_return)
Before running the lapply(), test get_ev_return() on NESN, and see if it works for one frame. Also, you'll need to check that dates within the get_ev_return() function contains any dates.. Maybe you have one or more data.frames where there are no rows where Date == EPS.... Again, this is the problem with not providing any sample data in your question.
I'm relatively new to R and I have looked for an answer for my problem but didn't find one. I want to compare two dataframes.
library(dplyr)
library(gtools)
v1 <- LETTERS[1:10]
combinations_from_4_letters <- (as.data.frame(combinations(n = 10, r = 4, v = v1),
stringsAsFactors = FALSE))
combinations_from_4_letters$group <- rep(1:15, each = 14)
combinations_from_2_letters <- (as.data.frame(combinations(n = 10, r = 2, v = v1),
stringsAsFactors = FALSE))
Dataframe 'combinations_from_4_letters' contains all combinations that can be made from 10 letters without repetitions and permutations. The combinations are binned into groups from 1-15. I want to find out how often pairs of the 10 letters (saved in dataframe 'combinations_from_2_letters') are found in each group (basically a frequency table). I started doing a complicated loop looping through both dataframes but I think there must be a more 'R' solution to it, similar to comparing a dataframe and a vector like:
combinations_from_4_letters %in% combinations_from_2_letters[i,])
Thank you in advance for your help!
I recommend an approach like the following:
# adding dummy column for a complete cross-join
combinations_from_4_letters = combinations_from_4_letters %>%
mutate(ones = 1)
combinations_from_2_letters = combinations_from_2_letters %>%
mutate(ones = 1)
joined = combinations_from_2_letters %>%
inner_join(combinations_from_4_letters, by = "ones") %>%
# comparison goes here
mutate(within = ifelse(comb2 %in% comb4, 1, 0)) %>%
group_by(comb2) %>%
summarise(freq = sum(within))
You'll probably need to modify to ensure it matches the exact column names and your comparison condition.
Key ideas:
adding filler column so we have a complete cross-join
mutate a new indicator column for whether the two letter pair is within the four letter pair
sum indicators on the two letter pair
I've got some poorly structured data I am trying to clean. I have a list of keywords I can use to extract data frames from a CSV file. My raw data is structured roughly as follows:
There are 7 columns with values, the first columns are all string identifiers, like a credit rating or a country symbol (for FX data), while the other 6 columns are either a header like a percentage change string (e.g. +10%) or just a numerical value. Since I have all this data lumped together, I want to be able to extract data for each category. So for instance, I'd like to extract all the rows between my "credit" keyword and my "FX" keyword in my first column. Is there a way to do this in either base R or dplyr easily?
eg.
df %>%
filter(column1 = in_between("credit", "FX"))
Sample dataframe:
row 1: c('random',-1%', '0%', '1%, '2%')
row 2: c('credit', NA, NA, NA, NA)
row 3: c('AAA', 1,2,3,4)
...
row n: c('FX', '-1%', '0%', '1%, '2%')
And I would want the following output:
row 1: c('credit', -1%', '0%', '1%, '2%')
row 2: c('AAA', 1,2,3,4)
...
row n-1: ...
If I understand correctly you could do something like
start <- which(df$column1 == "credit")
end <- which(df$column1 == "FX")
df[start:(end-1), ]
Of course this won't work if "credit" or "FX" is in the column more than once.
Using what Brian suggested:
in_between <- function(df, start, end){
return(df[start:(end-1),])
}
Then loop over the indices in
dividers = which(df$column1 %in% keywords == TRUE)
And save the function outputs however one would like.
lapply(1:(length(dividers)-1), function(x) in_between(df, start = dividers[x], end = dividers[x+1]))
This works. Messy data so I still have the annoying case where I need to keep the offset rows.
I'm still not 100% sure what you are trying to accomplish but does this do what you need it to?
set.seed(1)
df <- data.frame(
x = sample(LETTERS[1:10]),
y = rnorm(10),
z = runif(10)
)
start <- c("C", "E", "F")
df2 <- df %>%
mutate(start = x %in% start,
group = cumsum(start))
split(df2, df2$group)
I want to replace certain values in a data frame column with values from a lookup table. I have the values in a list, stuff.kv, and many values are stored in the list (but some may not be).
stuff.kv <- list()
stuff.kv[["one"]] <- "thing"
stuff.kv[["two"]] <- "another"
#etc
I have a dataframe, df, which has multiple columns (say 20), with assorted names. I want to replace the contents of the column named 'stuff' with values from 'lookup'.
I have tried building various apply methods, but nothing has worked.
I built a function, which process a list of items and returns the mutated list,
stuff.lookup <- function(x) {
for( n in 1:length(x) ) {
if( !is.null( stuff.kv[[x[n]]] ) ) x[n] <- stuff.kv[[x[n]]]
}
return( x )
}
unlist(lapply(df$stuff, stuff.lookup))
The apply syntax is bedeviling me.
Since you made such a nice lookup table, You can just use it to change the values. No loops or apply needed.
## Sample Data
set.seed(1234)
DF = data.frame(stuff = sample(c("one", "two"), 8, replace=TRUE))
## Make the change
DF$stuff = unlist(stuff.kv[DF$stuff])
DF
stuff
1 thing
2 another
3 another
4 another
5 another
6 another
7 thing
8 thing
Below is a more general solution building on #G5W's answer as it doesn't cover the case where your original data frame has values that don't exist in the lookup table (which would result in length mismatch error):
library(dplyr)
stuff.kv <- list(one = "another", two = "thing")
df <- data_frame(
stuff = rep(c("one", "two", "three"), each = 3)
)
df <- df %>%
mutate(stuff = paste(stuff.kv[stuff]))