I have ~4000 observations in my data frame, test_11, and have pasted part of the data frame below:
data frame snippit
The k_hidp column represents matching households, the k_fihhmnnet1_dv column is their reported household income and the percentage_income_rounded reports each participant's income contribution to the total household income
I want to filter my data to remove all k_hidp observations where their collective income in the percentage_income_rounded does not equal 100.
So for example, the first household 68632420 reported a contribution of 83% (65+13) instead of the 100% as the other households report.
Is there any way to remove these household observations so I am only left with households with a collective income of 100%?
Thank you!
Try this:
## Creating the dataframe
df=data.frame(k_hidp = c(68632420,68632420,68632420,68632420,68632420,68632420,68632422,68632422,68632422,68632422,68632428,68632428),
percentage_income_rounded = c(65,18,86,14,49,51,25,25,25,25,50,50))
## Loading the libraries
library(dplyr)
## Aggregating and determining which household collective income is 100%
df1 = df %>%
group_by(k_hidp) %>%
mutate(TotalPercentage = sum(percentage_income_rounded)) %>%
filter(TotalPercentage == 100)
Output
> df1
# A tibble: 6 x 3
# Groups: k_hidp [2]
k_hidp percentage_income_rounded TotalPercentage
<dbl> <dbl> <dbl>
1 68632422 25 100
2 68632422 25 100
3 68632422 25 100
4 68632422 25 100
5 68632428 50 100
6 68632428 50 100
Related
I have a dataset similar to the below:
Area
2020
2021
2022
AreaA
4,000
6,000
8,000
AreaB
5,000
7,000
9,000
I'm looking to amend the dataset to predict values for AreaA and AreaB for 2023 based on the three previous years - can anyone please advise? If more data points are required for validity then I can add any number of additional data points but if 3 suffices then that would be ideal. Thank you!
There are a few ways to do this, and it really depends on what pattern you expect the data to follow. In your example, it looks as though the trend is linear, so you might simply want to get predictions from a linear model.
To do this, it would be far easier if you put your data in a tidy format (that is, one row for each observation, with a column for years and a column for values). We can do that as follows:
library(tidyverse)
df_long <- df %>%
pivot_longer(-Area, names_to = 'Year') %>%
mutate(Year = as.numeric(Year))
Our data now looks like this:
df_long
#> # A tibble: 6 x 3
#> Area Year value
#> <chr> <dbl> <dbl>
#> 1 AreaA 2020 4000
#> 2 AreaA 2021 6000
#> 3 AreaA 2022 8000
#> 4 AreaB 2020 5000
#> 5 AreaB 2021 7000
#> 6 AreaB 2022 9000
Now we can do a linear regression on the value according to Year and Area
model <- lm(value ~ Area * Year, data = df_long)
Using this model, we can get predictions for each area in the next two years by simply creating a data frame of the desired years and areas, then plugging this into predict along with our model.
newdata <- data.frame(Area = rep(c('AreaA', 'AreaB'), 2),
Year = rep(2023:2024, each = 2))
newdata$value <- predict(model, newdata = newdata)
Assuming you want this put back into the original format, we just pivot from long format back to wide format:
pivot_wider(bind_rows(df_long, newdata), names_from = Year,
values_from = value)
#> # A tibble: 2 x 6
#> Area `2020` `2021` `2022` `2023` `2024`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AreaA 4000 6000 8000 10000. 12000.
#> 2 AreaB 5000 7000 9000 11000. 13000
Reassuringly, we can see that this continues the pattern for each series of increasing by 2,000 every year.
If you don't expect your data to change linearly with time, then we would need to know what pattern you expect before being able to advise you.
Created on 2023-01-24 with reprex v2.0.2
I have a dataset that looks a bit like this:
Income
Income period
1500
3
400
2
30000
1
Where 1 is yearly, 2 is weekly, and 3 is monthly.
I want to create a column that will show the income yearly for all rows so that I can compare them more easily.
Apologies if this is a very simple question, I guess I could recode 3 to be 12 and then have a formula that multiplies these columns together and then recode 2 to be 52 and do the same, just wanted to see if anyone has a better way of doing things as there are actually multiple columns like this with different codes for time periods that I need to fix.
library(dplyr)
df %>%
mutate(income_yr = case_when(period == 3 ~ income * 12,
period == 2 ~ income * 52,
TRUE ~ income))
#> income period income_yr
#> 1 1500 3 18000
#> 2 400 2 20800
#> 3 30000 1 30000
data
df <- data.frame(income = c(1500, 400, 30000),
period = c(3, 2, 1))
Created on 2021-04-13 by the reprex package (v2.0.0)
I know how to work and computing math/statistics with one dataframe. But, what happens when I have to deal with two? For example:
> df1
supervisor salesperson
1 Supervisor1 Matt
2 Supervisor2 Amelia
3 Supervisor2 Philip
> df2
month channel Matt Amelia Philip
1 Jan Internet 10 50 20
2 Jan Cellphone 20 60 30
3 Feb Internet 40 40 30
4 Feb Cellphone 30 120 40
How can I compute the sales by supervisor grouped by channel in a efficient and generalizable way?. Is there any methodology or criteria when you need to relate two or more dataframes in order to compute the data you need?
PS: The number are the sales made by each sales person.
Here is the idea of converting to long and merging using tidyverse,
library(tidyverse)
df2 %>%
gather(salesperson, val, -c(1:2)) %>%
left_join(., df1, by = 'salesperson') %>%
spread(salesperson, val, fill = 0) %>%
group_by(channel, supervisor) %>%
summarise_at(vars(names(.)[4:6]), funs(sum))
which gives,
# A tibble: 4 x 5
# Groups: channel [?]
channel supervisor Amelia Matt Philip
<fct> <fct> <dbl> <dbl> <dbl>
1 Cellphone Supervisor1 0. 50. 0.
2 Cellphone Supervisor2 180. 0. 70.
3 Internet Supervisor1 0. 50. 0.
4 Internet Supervisor2 90. 0. 50.
NOTE: You can also add month in the group_by
library(tidyverse)
I feel like there is a simple solution for this but I'm stuck. The code below creates a simple list of two dataframes (they are the same for simplicity of the example, but the real data has different values)
Loc<-c("Montreal","Toronto","Vancouver","Quebec","Ottawa","Hamilton","Total")
Count<-c("2344","2322","122","45","4544","44","9421")
Data<-data_frame(Loc,Count)
Data2<-data_frame(Loc,Count)
Data3<-list(Data,Data2)
Each dataframe has "Total" within the "Loc" column with the corresponding overall total of the "Count" column. I would like to calculate percentages for each dataframe by dividing each value in the "Count" column by the total, which is the last number in the "Count" column.
I would like the percentages to be added as new columns for each dataframe.
For this example, the total is the last number in the column, but in reality, it may be mixed anywhere in the column and can be found by the corresponding "Total" value in the "Loc" column.
I would like to use purrr and Tidyverse:
Below is an example of the code, but I'm stuck on the percentage...
Data3%>%map(~mutate(.x,paste0(round(100* (MISSING PERCENTAGE),2),"%"))
This solution uses only base-R:
for (i in seq_along(Data3)) {
Data3[[i]]$Count <- as.numeric(Data3[[i]]$Count)
n <- nrow(Data3[[i]])
Data3[[i]]$perc <- Data3[[i]]$Count / Data3[[i]]$Count[n]
}
> Data3
[[1]]
# A tibble: 7 x 3
Loc Count perc
<chr> <dbl> <dbl>
1 Montreal 2344 0.248805859
2 Toronto 2322 0.246470651
3 Vancouver 122 0.012949793
4 Quebec 45 0.004776563
5 Ottawa 4544 0.482326717
6 Hamilton 44 0.004670417
7 Total 9421 1.000000000
[[2]]
# A tibble: 7 x 3
Loc Count perc
<chr> <dbl> <dbl>
1 Montreal 2344 0.248805859
2 Toronto 2322 0.246470651
3 Vancouver 122 0.012949793
4 Quebec 45 0.004776563
5 Ottawa 4544 0.482326717
6 Hamilton 44 0.004670417
7 Total 9421 1.000000000
So I have a large data set, with many columns(10) and 100,000 rows. One of the columns is the date of observation with two other corresponding columns, one species and the other year. First, I want to create a new column that will give me the mean date of observation for each species for each year for the first 10% of the observations( for each species for each year). Second, I want to reduce that data set so that only rows involved in the calculation (ie: the first 10%) remain. Finally, it's important that my new data set has the other corresponding columns with information for each observation ie, the location ect.
Sample of the data set (there do exist more columns):
date=c(3,84,98,100,34,76,86...)
species=c(blue,purple,grey,purple,green,pink,pink,white...)
id=c(1,2,3,2,4,5,5,6...)
year=c(1901,2000,1901,1996,1901,2000,1986...)
habitat=c(forest,plain,mountain...)
Ex: the first row says species blue was seen on jan 3rd 1901 in a forest.
Ok, here's one approach using dplyr. This will get you the mean of a variable, by species and year, using the first 10% of observation for each grouping.
require(dplyr)
# test data set
test <- data.frame(species = c(rep("blue", 100), rep("purple",100)),
year = rep(c(1901, 1902, 1903, 1904, 1905), 40),
value = rnorm(200),
stringsAsFactors = FALSE)
# checking data set
group_by(test, species, year) %>% summarise(n = n(), mean.value = mean(value))
# by species and year, identify first ten per cent of observations
test <- test %>% group_by(species, year) %>%
mutate(nth.ob = seq_along(species), n.obs = n(), pc = round((nth.ob/n.obs*100), 2) ) %>%
arrange(species, year) # sort for easy viewing
# and check
head(test)
Source: local data frame [6 x 6]
Groups: species, year
species year value nth.ob n.obs pc
1 blue 1901 -0.2839094 1 20 5
2 blue 1901 -1.7158035 2 20 10
3 blue 1901 1.1664650 3 20 15
4 blue 1901 -0.0935940 4 20 20
5 blue 1901 -0.1199253 5 20 25
6 blue 1901 0.3461677 6 20 30
# reduce to top 10 %, summarise and drop unwanted variables
out <- test %>%
filter(pc <= 10) %>% # select first 10% of observations by species and year
summarise(mean_val = mean(value))
out
Source: local data frame [10 x 3]
Groups: species
species year mean_val
1 blue 1901 -0.99985643
2 blue 1902 0.08355729
3 blue 1903 0.67396796
4 blue 1904 0.14425229
5 blue 1905 -0.19426698
6 purple 1901 0.95767665
7 purple 1902 -0.40730494
8 purple 1903 0.10032964
9 purple 1904 0.36295224
10 purple 1905 1.30953008
If you then want the settings in which the first observation was detected, I think the best way to do that would be to do something like
setting <- group_by(species, year) %>%
filter(row_number() == 1)
and then join the data to the out data set