I have right now a dataset with more than 186k observations (rows), this is presented in figure 1. These are all companies in BVDID column and they should contain data in all years of 2013 to 2017.
missingdata <- series %>% filter(LIABILITIES == 0) %>% select(BVDID)
However, I found 87k rows of only zero-values in missingdata object using the code above.
How do I delete the rows of the series object with BVDID (company code) in the dataframe missing data? Also there should be a way to make those years look better under my str(series) and put them ascending based on each company code.
Best regards
THERE are many ways, one such way.
use tidyverse anti_join function which gives the result as similar to set operation A-B and therefore will remove all matching rows from the second data.
series %>% anti_join(missingdata, by =c("BVDID" = "BVDID"))
Or directly. Liabilities == 0 will return boolean values, adding + before it converts these to 0 or 1 and checking the sum of these values if greater than 1, which are to be removed.
series %>% group_by(BVDID) %>% filter(sum(+(LIABILITIES == 0)) > 0)
series %>%
# filter out the BVDIDs from missingdata
filter(!BVDID %in% pull(missingdata)) %>%
# order the df
arrange(BVDID, year)
Related
Consider a set of time series having the same length. Some have missing data in the end, due to the product being out of stock, or due to delisting.
If the series contains at least four missing observations (in my case it is value = 0 and not NA) at the end, I consider the series as delisted.
In my time series panel, I want to separate the series with delisted id's from the other ones and create two different dataframes based on this separation.
I created a simple reprex to illustrate the problem:
library(tidyverse)
library(lubridate)
data <- tibble(id = as.factor(c(rep("1",24),rep("2",24))),
date = rep(c(ymd("2013-01-01")+ months(0:23)),2),
value = c(c(rep(1,17),0,0,0,0,2,2,3), c(rep(9,20),0,0,0,0))
)
I am searching for a pipeable tidyverse solution.
Here is one possibility to find delisted ids
data %>%
group_by(id) %>%
mutate(delisted = all(value[(n()- 3):n()] == 0)) %>%
group_by(delisted) %>%
group_split()
In the end I use group_split to split the data into two parts: one containing delisted ids and the other one contains the non-delisted ids.
I'm learning to work with R and I would need some help figuring out how to create a dataframe from averages of subsets of my initial dataframe, based on a condition.
I have a df of ~18000 rows, 9 columns, one being a distance. I want to use conditions on the distance to average the values of the 9 columns. The first subset would correspond to a distance range of 0:2.5, the second one a range of 2.5:5, and so on, every 2.5 meters.
I can make a first subset this way :
df1 <- subset(df_ini, df_ini$Distance..m.>0 & df_ini$Distance..m.<2.5)
The new dataframe now has 18 rows.
I then need to average the values of each column, store them in a new df and continue to do so for every subset, appending the averages to the same df.
I can't get the right loop(s) to do that, I would really appreciate any ideas/tips.
Thanks!
I am pretty new to R, but maybe this will give some inspiration. I suggest that you take a look at the dplyr package.
For mtcars:
library(dplyr)
df1 <- mtcars %>% filter(mpg >= 10 & mpg <= 20) %>% summarise_all(funs(mean))
df2 <- mtcars %>% filter(mpg >=20 & mpg <= 30) %>% summarise_all(funs(mean))
combined <- rbind(subset, subset2)
In your dataset you can filter on ranges of distances. Ideally, you would use a loop to automatically create a range of groups (0:2.5, 2.5:5.0, etc..), like you stated in your post. I don't know how to do that though.
I want to aggregate my data. The goal is to have for each time interval one point in a diagram. Therefore I have a data frame with 2 columns. The first columns is a timestamp. The second is a value. I want to evaluate each time period. That means: The values be added all together within the Time period for example 1 second.
I don't know how to work with the aggregate function, because these function supports no time.
0.000180 8
0.000185 8
0.000474 32
It is not easy to tell from your question what you're specifically trying to do. Your data has no column headings, we do not know the data types, you did not include the error message, and you contradicted yourself between your original question and your comment (Is the first column the time stamp? Or is the second column the time stamp?
I'm trying to understand. Are you trying to:
Split your original data.frame in to multiple data.frame's?
View a specific sub-set of your data? Effectively, you want to filter your data?
Group your data.frame in to specific increments of a set time-interval to then aggregate the results?
Assuming that you have named the variables on your dataframe as time and value, I've addressed these three examples below.
#Set Data
num <- 100
set.seed(4444)
tempdf <- data.frame(time = sample(seq(0.000180,0.000500,0.000005),num,TRUE),
value = sample(1:100,num,TRUE))
#Example 1: Split your data in to multiple dataframes (using base functions)
temp1 <- tempdf[ tempdf$time>0.0003 , ]
temp2 <- tempdf[ tempdf$time>0.0003 & tempdf$time<0.0004 , ]
#Example 2: Filter your data (using dplyr::filter() function)
dplyr::filter(tempdf, time>0.0003 & time<0.0004)
#Example 3: Chain the funcions together using dplyr to group and summarise your data
library(dplyr)
tempdf %>%
mutate(group = floor(time*10000)/10000) %>%
group_by(group) %>%
summarise(avg = mean(value),
num = n())
I hope that helps?
I have some data (download link: http://spreadsheets.google.com/pub?key=0AkBd6lyS3EmpdFp2OENYMUVKWnY1dkJLRXAtYnI3UVE&output=xls) that I'm trying to filter. I had reconfigured the data so that instead of one row per country, and one column per year, each row of the data frame is a country-year combination (i.e. Afghanistan, 1960, NA).
Now that I've done that, I want to create a subset of the initial data that excludes any country that has 10+ years of missing contraceptive use data.
I had thought to create a list of the unique country names in a second data frame, and then add a variable to that frame that holds the # of rows for each country that have an NA for contraceptive use (i.e. for Afghanistan it would have 46). My first thought (being most fluent in VB.net) was to use a for loop to iterate through the countries, get the NA count for that country, and then update the second data frame with that value.
In that vein I tried the following:
for(x in cl){
+ x$rc = nrow(subset(BCU, BCU$Country == x$Country))
+ }
After that failed, a little more Googling brought me to a question on here (forgot to grab the link) that suggested using by(). Based on that I tried:
by(cl, 1:nrow(cl), cl$rc <- nrow(subset(BCU, BCU$Country == cl$Country
& BCU$Contraceptive_Use == "NA")))
(cl is the second data frame listing the country names, and BCU is the initial contraceptive use data frame)
I'm fairly new to R (the problem I'm working is for an R course on Udacity), so I'll freely admit this may not be the best approach, but I'm still curious how to do this sort of aggregation.
They all seem to have >= 10 years of missing data (unless I miscalculated somewhere):
library(tidyr)
library(dplyr)
dat <- read.csv("contraceptive use.csv", stringsAsFactors=FALSE, check.names=FALSE)
dat <- rename(gather(dat, year, value, -1),
country=`Contraceptive prevalence (% of women ages 15-49)`)
dat %>%
group_by(country) %>%
summarise(missing_count=sum(is.na(value))) %>%
arrange(desc(missing_count)) -> missing
sum(missing$missing_count >= 10)
## [1] 213
length(unique(dat$country))
## [1] 213
I am new to R language.
I have a dataset, the first column is the name and second is the value.
I want to read each value in the second column and check if the value falls into a certain range.
For example,
Name value
AA 123
and the range in (100,150)
which function can be used?
Thank you in advance.
you can use this code:
for(i in 1:nrow(df)){
if(df[i,2]<150){
if(df[i,2]>100){
print(df[i,])
}
}
}
that the df is your dataset
It is not entirely clear what the desired result is.
Option 1 could be a new column is created that records (e.g. TRUE/FALSE) whether an observation (row) is in the limits. For example, with tidyverse style code to create a column called "is_valid":
df %>% mutate(is_valid = ...)
Option 2 could be filtering the data to create a subset that has only the desired observations, e.g. with tidyverse style code.
df %>% filter(value <= 150) %>% filter(value >= 100)
In any case, I'd recommend spending some time to learn how to do this in a tidyverse way with dplyr (google the free online book R for Data Science).