How to get a conditional proportion in a tibble in r - r

I have this tibble
host_id district availability_365
<dbl> <chr> <dbl>
1 8573 Fatih 280
2 3725 Maltepe 365
3 1428 Fatih 355
4 6284 Fatih 164
5 3518 Esenyurt 0
6 8427 Esenyurt 153
7 4218 Fatih 0
8 5342 Kartal 134
9 4297 Pendik 0
10 9340 Maltepe 243
# … with 51,342 more rows
I want to find out how high the proportion of the hosts (per district) is which have all their rooms on availability_365 == 0. As you can see there are 51352 rows but there aren't different hosts in all rows. There are actually exactly 37572 different host_ids.
I know that I can use the command group_by(district) to get it split up into the 5 different districts but I am not quite sure how to solve the issue to find out how many percent of the hosts only have rooms with no availability. Anybody can help me out here?

Use summarise() function along with group_by() in dplyr.
library(dplyr)
df %>%
group_by(district) %>%
summarise(Zero_Availability = sum(availability_365==0)/n())
# A tibble: 5 x 2
district Zero_Availability
<chr> <dbl>
1 Esenyurt 0.5
2 Fatih 0.25
3 Kartal 0
4 Maltepe 0
5 Pendik 1

It's difficult to make sure my answer is working without actually having the data, but if you're open to using data.table, the following should work
library(data.table)
setDT(data)
data[, .(no_avail = all(availability_365 == 0)), .(host_id, district)][, .(
prop_no_avail = sum(no_avail) / .N
), .(district)]

Related

How do I change numeric values in a subset of columns in a R dataframe to other numeric values?

I have a dataset with currently 4 rows /subjects (more to come as this is ongoing research) and 259 variables /columns. 240 variables of this dataset are ratings of fit ("How well does the following adjective match the dimension X?" and 19 variables are sociodemographic.
For these 240 rating-variables, my subjects could give a rating ranging from 1 ("fits very badly") to 7 ("fits very well"). Consequently, I have a 240 variables numbered from 1 to 7. I would like to change these numeric values as follows (the procedure being the same for all of the 240 columns)
1 should change to 0, 2 should change to 1/6, 3 should change to 2/6, 4 should change to 3/6, 5 should change to 4/6, 6 should change to 5/6 and 7 should change to 1. So no matter where in the 240 columns, a 1 should change to 0 and so on.
I have tried the following approaches:
Recode numeric values in R
In this post, it says that
x <- 1:10
# With recode function using backquotes as arguments
dplyr::recode(x, `2` = 20L, `4` = 40L)
# [1] 1 20 3 40 5 6 7 8 9 10
# With case_when function
dplyr::case_when(
x %in% 2 ~ 20,
x %in% 4 ~ 40,
TRUE ~ as.numeric(x)
)
# [1] 1 20 3 40 5 6 7 8 9 10
Consequently, I tried this:
df = ds %>% select(AD01_01:AD01_20,AD02_01:AD02_20,AD03_01:AD03_20,AD04_01:AD04_20,AD05_01:AD05_20,AD06_01:AD06_20, AD09_01:AD09_20,AD10_01:AD10_20,AD11_01:AD11_20,AD12_01:AD12_20,AD13_01:AD13_20,AD14_01:AD14_20)
%>% recode(.,`1`=0,`2`=-1/6,`3`=-2/6, `4`=3/6,`5`=4/6, `6`=5/6, `7`=1))
with AD01_01 etc. being the column names for the adjectives my subjects should rate. I also tried it without the ., after recode(, to no avail.
This code is flawed because it omits the 19 rows of sociodemographic data I want to keep in my dataset. Moreover, I get the error unexpected SPECIAL in "%>%".
I thought R might accept my selected columns with the pipe operator as the "x" in recode. Apparently, this is not the case. I also tried to read up on the R documentation of recode but it made things much more confusing for me, as there were a lot of technical terms I don't understand.
As there is another option mentioned in the post, I also tried this:
df = df %>% select(AD01_01:AD01_20,AD02_01:AD02_20,AD03_01:AD03_20,AD04_01:AD04_20,AD05_01:AD05_20,AD06_01:AD06_20, AD09_01:AD09_20,AD10_01:AD10_20,AD11_01:AD11_20,AD12_01:AD12_20,AD13_01:AD13_20,AD14_01:AD14_20) %>% case_when (.,%in% 1~0,%in% 2~1/6,%in%3~2/6,%in%4~3/6,%in%5~4/6,%in%6~5/6,%in%7~1)
I thought I could give the output of the select function to the case_when function. Apparently, this is also not the case.
When I execute this command, I get
Error: unexpected SPECIAL in:
"df = df %>% select(AD01_01:AD01_20,AD02_01:AD02_20,AD03_01:AD03_20,AD04_01:AD04_20,AD05_01:AD05_20,AD06_01:AD06_20, AD09_01:AD09_20,AD10_01:AD10_20,AD11_01:AD11_20,AD12_01:AD12_20,AD13_01:AD13_20,AD14_01:AD14_20) %>% case_when (%in%"
Reading up on other possibilities, I found this
https://rstudio-education.github.io/hopr/modify.html
exemplary dataset:
head(dplyr::storms)
## # A tibble: 6 x 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012
## # ... with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
# We decide that we want to recode all NAs to 9999.
storm <- storms
storm$ts_diameter[is.na(storm$ts_diameter)] <- 9999
summary(storm$ts_diameter)
ds$AD01_01:AD01_20[1(ds$AD01_01:AD01_20)] <- 0, ds$AD01_01:AD01_20[2(ds$AD01_01:AD01_20)] <- 1/6, ds$AD01_01:AD01_20[3(ds$AD01_01:AD01_20)] <- 2/6,
ds$AD01_01:AD01_20[4(ds$AD01_01:AD01_20)] <- 3/6, ds$AD01_01:AD01_20[5(ds$AD01_01:AD01_20)] <- 4/6, ds$AD01_01:AD01_20[6(ds$AD01_01:AD01_20)] <- 5/6,
ds$AD01_01:AD01_20[7(ds$AD01_01:AD01_20)] <- 1
My idea in this case was to use assign for multiple columns at a time (this effort just concerns 20 of my 240 columns and it also didn't work. I got the error
could not find function ":<-" which is weird because I thought this was a basic command. The only noteworthy thing that might explain is that I executed library(readr) and library(tidyverse) beforehand.
Disclaimer: I am an R newbie and have spent 2 hours to try to solve this issue. I would also like to know where I went wrong and why my code doesn't work.
How about using mutate(across())? For example, if all your "adjective rating" columns start with "AD", you can do something like this:
library(dplyr)
ds %>% mutate(across(starts_with("AD"), ~(.x-1)/6))
Explanation of where you went wrong with your code:
First, your select(...) %>% recode(...) was close. However, when you use select, you are reducing ds to only the selected columns, thus recoding those values and assigning to df will result in df not having the demographic variables.
Second, if you want to use recode you can, but you can't feed it an entire data frame/tibble, like you are doing when you pipe (%>%) the selected columns to it. Instead, you can use recode() iteratively in .fns, on each of the columns in the .cols param of across(), like this:
ds %>%
mutate(across(
.cols = starts_with("AD"),
.fns = ~recode(.x,`1`=0,`2`=-1/6,`3`=-2/6, `4`=3/6,`5`=4/6, `6`=5/6, `7`=1))
)

R dplyr summarise across remaining columns

Using R's dplyr is there a clean way to use the across function to select variables not already captured in other across statements?
For example, I could have a data set that I want to summarise on a 1-to-1 basis (i.e. the output columns have the same structure as the input data set), applying the first function to the character fields, the mean function to a select number of numeric fields and then sum to all remaining fields.
Typically I have a large number of columns with various selection methods, so explicitly working out the remaining columns can be overall onerous. To date, I have found workarounds given the particular data set I am working with but having a generic solution would be very useful.
The code below shows what I would like to run and the resulting output. remaining_columns() is made up.
library(dplyr)
first.if.unique = function(x) if(length(unique(x))==1) x[1] else NA
x = starwars %>% select(species,sex,mass,height) %>% head(10)
cols_to_average = c("mass")
x %>%
group_by(sex) %>%
summarise(
across(where(is.character),first.if.unique),
across(any_of(cols_to_average),mean),
across(remaining_columns(),sum)
)
Input / Output data sets:
> print(x)
# A tibble: 10 x 4
species sex mass height
<chr> <chr> <dbl> <int>
1 Human male 77 172
2 Droid none 75 167
3 Droid none 32 96
4 Human male 136 202
5 Human female 49 150
6 Human male 120 178
7 Human female 75 165
8 Droid none 32 97
9 Human male 84 183
10 Human male 77 182
> print(y)
# A tibble: 3 x 4
sex species mass height
<chr> <chr> <dbl> <int>
1 female Human 62 315
2 male Human 98.8 917
3 none Droid 46.3 360
I personally don't see a way to achieve this apart from specifying the negative previous selections.
x %>%
group_by(sex) %>%
summarise(
across(where(is.character),first.if.unique),
across(any_of(cols_to_average),mean),
across(!where(is.character)&!any_of(cols_to_average),sum)
)
# A tibble: 3 x 4
sex species mass height
<chr> <chr> <dbl> <int>
1 female Human 62 315
2 male Human 98.8 917
3 none Droid 46.3 360

Why doesn't rowSums work on data frame created using pivot_wider in R?

I'm trying to work out total volume remaining and the average volume for a large data set, which I thought would be a simple case of using rowSums and rowMeans on the data frame I created using pivot wider but I keep encountering the same errors.
df<-data.frame("parent"=c("001","001","001","001","002","002","002","002",
"003","003","003","003","004","004","004","004"),"tube"=c("tube1",
"tube2","tube3","tube4","tube1","tube2","tube3","tube4",
"tube1","tube2","tube3","tube4","tube1","tube2","tube3","tube4"),
"microlitres"=c(100,120,60,100,NA,200,100,120,
60,100,120,40,100,120,400,NA))
pivot_wider(df,names_from = tube,values_from = microlitres)->df
df$sum<-rowSums(df,na.rm=TRUE)
I get "Error: x must be numeric", and then when I alter the code to
df$sum<-rowSums(as.numeric(df),na.rm=TRUE)
I get "Error: List object cannot be coerced to double".
I've spent a long time googling and haven't come across anything that helps. I'm sure there's a simple fix but I just can't see it. I've tried using mutate with nested rowSums, I've tried unlist(), and converting it to a matrix. I'd be very grateful for any help and advice!
I hope the output is the one you had in mind:
library(dplyr)
df %>%
rowwise() %>%
mutate(sum_cols = sum(c_across(tube1:tube4), na.rm = TRUE),
mean_cols = mean(c_across(tube1:tube4), na.rm = TRUE))
# A tibble: 4 x 7
# Rowwise:
parent tube1 tube2 tube3 tube4 sum_cols mean_cols
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001 100 120 60 100 380 95
2 002 NA 200 100 120 420 140
3 003 60 100 120 40 320 80
4 004 100 120 400 NA 620 207.
This should work:
df$sum <- rowSums(sapply( df, as.numeric), na.rm=TRUE)
The problem is that pivot_wider treats some of the columns as character by default and as.numeric() takes a vector as inputs.
> df
# A tibble: 4 x 6
parent tube1 tube2 tube3 tube4 sum
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001 100 120 60 100 762
2 002 NA 200 100 120 422
3 003 60 100 120 40 646
4 004 100 120 400 NA 624

read.csv - to separate information stored in .csv based on the presence or absence of a duplicate value

First of all - apologies, I'm new to all of this, so I may write things in a confusing way.
I have multiple .csv files that I need to read, and to save a lot of time I am looking to find an automated way of doing this.
I am looking to read different rows of the .csv and store the information as two separate files, based on the information stored in the last column.
My data is specifically areas, and slices of a 3D image, which I will use to compile volumes. If two rows have the same "slice" then I need to separate them, as the area found in row 1 corresponds to a different structure to the one with an area in row 2, on the same slice.
Eg:
Row,area,slice
1,50,180
2,52,180
3,49,181
4,53,181
5,65,182
6,60,183
So slice structure 1 has an area at slice 180 (area = 50) and 181 (area = 49), whereas structure 2 has an area at each slice from 180 to 183.
I want to be able to store all the bold data in one .csv, and all the other data in another .csv
There may be .csv files with more or less overlapping slice values, adding complexity to this.
Thank you for the help, please let me know if I need to clarify anything.
Use duplicated:
dat <- read.csv(text="
Row,area,slice
1,50,180
2,52,180
3,49,181
4,53,181
5,65,182
6,60,183")
dat[duplicated(dat$slice),]
# Row area slice
# 2 2 52 180
# 4 4 53 181
dat[!duplicated(dat$slice),]
# Row area slice
# 1 1 50 180
# 3 3 49 181
# 5 5 65 182
# 6 6 60 183
(Whether you write each of these last two frames to files or store them for later use is up to you.)
duplicated normally returns TRUE for the second and subsequent incidents of the field(s). Your logic of 2,4,5,6 is more along the lines of "last of the dupes or "no dupes", which is a little different.
library(dplyr)
dat %>%
group_by(slice) %>%
slice(-n()) %>%
ungroup()
# # A tibble: 2 x 3
# Row area slice
# <int> <int> <int>
# 1 1 50 180
# 2 3 49 181
dat %>%
group_by(slice) %>%
slice(n()) %>%
ungroup()
# # A tibble: 4 x 3
# Row area slice
# <int> <int> <int>
# 1 2 52 180
# 2 4 53 181
# 3 5 65 182
# 4 6 60 183
Similarly, with data.table:
library(data.table)
as.data.table(dat)[, .SD[.N,], by = .(slice)]
# slice Row area
# 1: 180 2 52
# 2: 181 4 53
# 3: 182 5 65
# 4: 183 6 60
as.data.table(dat)[, .SD[-.N,], by = .(slice)]
# slice Row area
# 1: 180 1 50
# 2: 181 3 49

Find a function to return value based on condition using R

I have a table with values
KId sales_month quantity_sold
100 1 0
100 2 0
100 3 0
496 2 6
511 2 10
846 1 4
846 2 6
846 3 1
338 1 6
338 2 0
now i require output as
KId sales_month quantity_sold result
100 1 0 1
100 2 0 1
100 3 0 1
496 2 6 1
511 2 10 1
846 1 4 1
846 2 6 1
846 3 1 0
338 1 6 1
338 2 0 1
Here, the calculation has to go as such if quantity sold for the month of march(3) is less than 60% of two months January(1) and February(2) quantity sold then the result should be 1 or else it should display 0. Require solution to perform this.
Thanks in advance.
If I understand well, your requirement is to compare sold quantity in month t with the sum of quantity sold in months t-1 and t-2. If so, I can suggest using dplyr package that offer the nice feature of grouping rows and mutating columns in your data frame.
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
The result is as below:
Adding
select(KId,sales_month, quantity_sold, result)
at the end let us display only columns we care about (and not all these intermediate steps).
I believe this should satisfy your requirement. NA is the result column are due to 0/0 division or no data at all for the previous months.
Should you need to expand your calculation beyond one calendar year, you can add year column and adjust group_by() arguments appropriately.
For more information on dplyr package, follow this link

Resources