How do I change numeric values in a subset of columns in a R dataframe to other numeric values? - r

I have a dataset with currently 4 rows /subjects (more to come as this is ongoing research) and 259 variables /columns. 240 variables of this dataset are ratings of fit ("How well does the following adjective match the dimension X?" and 19 variables are sociodemographic.
For these 240 rating-variables, my subjects could give a rating ranging from 1 ("fits very badly") to 7 ("fits very well"). Consequently, I have a 240 variables numbered from 1 to 7. I would like to change these numeric values as follows (the procedure being the same for all of the 240 columns)
1 should change to 0, 2 should change to 1/6, 3 should change to 2/6, 4 should change to 3/6, 5 should change to 4/6, 6 should change to 5/6 and 7 should change to 1. So no matter where in the 240 columns, a 1 should change to 0 and so on.
I have tried the following approaches:
Recode numeric values in R
In this post, it says that
x <- 1:10
# With recode function using backquotes as arguments
dplyr::recode(x, `2` = 20L, `4` = 40L)
# [1] 1 20 3 40 5 6 7 8 9 10
# With case_when function
dplyr::case_when(
x %in% 2 ~ 20,
x %in% 4 ~ 40,
TRUE ~ as.numeric(x)
)
# [1] 1 20 3 40 5 6 7 8 9 10
Consequently, I tried this:
df = ds %>% select(AD01_01:AD01_20,AD02_01:AD02_20,AD03_01:AD03_20,AD04_01:AD04_20,AD05_01:AD05_20,AD06_01:AD06_20, AD09_01:AD09_20,AD10_01:AD10_20,AD11_01:AD11_20,AD12_01:AD12_20,AD13_01:AD13_20,AD14_01:AD14_20)
%>% recode(.,`1`=0,`2`=-1/6,`3`=-2/6, `4`=3/6,`5`=4/6, `6`=5/6, `7`=1))
with AD01_01 etc. being the column names for the adjectives my subjects should rate. I also tried it without the ., after recode(, to no avail.
This code is flawed because it omits the 19 rows of sociodemographic data I want to keep in my dataset. Moreover, I get the error unexpected SPECIAL in "%>%".
I thought R might accept my selected columns with the pipe operator as the "x" in recode. Apparently, this is not the case. I also tried to read up on the R documentation of recode but it made things much more confusing for me, as there were a lot of technical terms I don't understand.
As there is another option mentioned in the post, I also tried this:
df = df %>% select(AD01_01:AD01_20,AD02_01:AD02_20,AD03_01:AD03_20,AD04_01:AD04_20,AD05_01:AD05_20,AD06_01:AD06_20, AD09_01:AD09_20,AD10_01:AD10_20,AD11_01:AD11_20,AD12_01:AD12_20,AD13_01:AD13_20,AD14_01:AD14_20) %>% case_when (.,%in% 1~0,%in% 2~1/6,%in%3~2/6,%in%4~3/6,%in%5~4/6,%in%6~5/6,%in%7~1)
I thought I could give the output of the select function to the case_when function. Apparently, this is also not the case.
When I execute this command, I get
Error: unexpected SPECIAL in:
"df = df %>% select(AD01_01:AD01_20,AD02_01:AD02_20,AD03_01:AD03_20,AD04_01:AD04_20,AD05_01:AD05_20,AD06_01:AD06_20, AD09_01:AD09_20,AD10_01:AD10_20,AD11_01:AD11_20,AD12_01:AD12_20,AD13_01:AD13_20,AD14_01:AD14_20) %>% case_when (%in%"
Reading up on other possibilities, I found this
https://rstudio-education.github.io/hopr/modify.html
exemplary dataset:
head(dplyr::storms)
## # A tibble: 6 x 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012
## # ... with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
# We decide that we want to recode all NAs to 9999.
storm <- storms
storm$ts_diameter[is.na(storm$ts_diameter)] <- 9999
summary(storm$ts_diameter)
ds$AD01_01:AD01_20[1(ds$AD01_01:AD01_20)] <- 0, ds$AD01_01:AD01_20[2(ds$AD01_01:AD01_20)] <- 1/6, ds$AD01_01:AD01_20[3(ds$AD01_01:AD01_20)] <- 2/6,
ds$AD01_01:AD01_20[4(ds$AD01_01:AD01_20)] <- 3/6, ds$AD01_01:AD01_20[5(ds$AD01_01:AD01_20)] <- 4/6, ds$AD01_01:AD01_20[6(ds$AD01_01:AD01_20)] <- 5/6,
ds$AD01_01:AD01_20[7(ds$AD01_01:AD01_20)] <- 1
My idea in this case was to use assign for multiple columns at a time (this effort just concerns 20 of my 240 columns and it also didn't work. I got the error
could not find function ":<-" which is weird because I thought this was a basic command. The only noteworthy thing that might explain is that I executed library(readr) and library(tidyverse) beforehand.
Disclaimer: I am an R newbie and have spent 2 hours to try to solve this issue. I would also like to know where I went wrong and why my code doesn't work.

How about using mutate(across())? For example, if all your "adjective rating" columns start with "AD", you can do something like this:
library(dplyr)
ds %>% mutate(across(starts_with("AD"), ~(.x-1)/6))
Explanation of where you went wrong with your code:
First, your select(...) %>% recode(...) was close. However, when you use select, you are reducing ds to only the selected columns, thus recoding those values and assigning to df will result in df not having the demographic variables.
Second, if you want to use recode you can, but you can't feed it an entire data frame/tibble, like you are doing when you pipe (%>%) the selected columns to it. Instead, you can use recode() iteratively in .fns, on each of the columns in the .cols param of across(), like this:
ds %>%
mutate(across(
.cols = starts_with("AD"),
.fns = ~recode(.x,`1`=0,`2`=-1/6,`3`=-2/6, `4`=3/6,`5`=4/6, `6`=5/6, `7`=1))
)

Related

How to get a conditional proportion in a tibble in r

I have this tibble
host_id district availability_365
<dbl> <chr> <dbl>
1 8573 Fatih 280
2 3725 Maltepe 365
3 1428 Fatih 355
4 6284 Fatih 164
5 3518 Esenyurt 0
6 8427 Esenyurt 153
7 4218 Fatih 0
8 5342 Kartal 134
9 4297 Pendik 0
10 9340 Maltepe 243
# … with 51,342 more rows
I want to find out how high the proportion of the hosts (per district) is which have all their rooms on availability_365 == 0. As you can see there are 51352 rows but there aren't different hosts in all rows. There are actually exactly 37572 different host_ids.
I know that I can use the command group_by(district) to get it split up into the 5 different districts but I am not quite sure how to solve the issue to find out how many percent of the hosts only have rooms with no availability. Anybody can help me out here?
Use summarise() function along with group_by() in dplyr.
library(dplyr)
df %>%
group_by(district) %>%
summarise(Zero_Availability = sum(availability_365==0)/n())
# A tibble: 5 x 2
district Zero_Availability
<chr> <dbl>
1 Esenyurt 0.5
2 Fatih 0.25
3 Kartal 0
4 Maltepe 0
5 Pendik 1
It's difficult to make sure my answer is working without actually having the data, but if you're open to using data.table, the following should work
library(data.table)
setDT(data)
data[, .(no_avail = all(availability_365 == 0)), .(host_id, district)][, .(
prop_no_avail = sum(no_avail) / .N
), .(district)]

Double loop to fill dataframe - how to fix "invalid function in complex assignment"

I have a dataframe:
results 2 (612 obs. 281 variables)
ID Q1000_p2000_2016 Q1893_p2039_2016 .... Q1000_p2000_2017 Q1893_p2039_2017
1 392 381 422 351
2 432 293 398 310
. . . . .
. . . . .
where there are 140 questions from 2016 and 140 from 2017, each year the questions share the same name but each variable name has "_2016" or "_2017" at the end to discriminate between time periods.
and another dataframe:
absdiff (0 obs. 141 variables)
ID Q1000_p2000 Q1893_p2039 ....
I want to assign a value in absdiff by taking the absolute difference of the two years, for each question for each ID.
In my condition, I check the question number for 2016 (or the first few characters of the variable name) matches the question number for 2017 in results2.
If that holds, I want to assign the absolute difference of the two answers to the corresponding variable/question number in absdiff
I have used
for (q in 2:141){
if (substr(colnames(results2[q]),1,12) == substr(colnames(results2[q+140]),1,12)){
for (j in 1:nrow(results2)){absdiff$substr(colnames(results2[q]),1,11) <- abs(results2[j,q] - results2[j,(q+140)])}
}
else
print("ERROR")
}
but I get this error message:
Error in absdiff$substr(colnames(results2[q]), 1, 11) <- abs(results2[j, :
invalid function in complex assignment
What problem causes this error message? How do I fix it?
For replication sake this can all be simplified to:
ID <- c(1,2)
Q1000_p2000_2016 <- c(392,432)
Q1893_p2039_2016 <- c(381,293)
Q1000_p2000_2017 <- c(422,398)
Q1893_p2039_2017 <- c(351,310)
results2 <- as.data.frame(cbind(ID, Q1000_p2000_2016, Q1893_p2039_2016 ,Q1000_p2000_2017, Q1893_p2039_2017 ))
absdiff <- results2[FALSE,1:3]
for (q in 2:3){
if (substr(colnames(results2[q]),1,12) == substr(colnames(results2[q+2]),1,12)){
for (j in 1:nrow(results2)){absdiff$substr(colnames(results2[q]),1,11) <- abs(results2[j,q] - results2[j,(q+2)])}
}
else
print("ERROR")
}
Don't use loops, but just vectorize. Get the 2016 columns, the 2017 columns and then subtract:
col2016<-grep("_2016$",names(results2),value=TRUE)
col2017<-grep("_2017$",names(results2),value=TRUE)
absdiff<-results2[,col2017]-results2[,col2016]
# Q1000_p2000_2017 Q1893_p2039_2017
#1 30 -30
#2 -34 17
To retain the ID column, just add it after:
absdiff$ID<-results2$ID
Quick notes on your code for future coding: The cause of the error here is this: absdiff$substr(colnames(results2[q]),1,11) because you can't use the dollar sign with a function just because it returns a string, you can however use the slicing brackets like this absdiff[substr(colnames(results2[q]),1,11)].
Another problem with the code is the fact that absdiff is initially empty when you call results2[FALSE,1:3] you get the column names but not the rows (if you want all the rows remove FALSE), which means you won't be able to give values to the new column.
And finally if you think you might need to do these kind of things more in the future I would recommend that you take a look at Tidy Data and the different methods you can use to reshape the data to make analysis easier and more intuitive, as an example with your sample data you could do something like this:
library(dplyr)
library(reshape2)
new_resutls <- results2 %>%
reshape2::melt(id.vars='ID') %>%
dplyr::mutate(question = substr(variable, 1, 11),
year = substr(variable, 13, 16))
new_resutls
# ID variable value question year
# 1 1 Q1000_p2000_2016 392 Q1000_p2000 2016
# 2 2 Q1000_p2000_2016 432 Q1000_p2000 2016
# 3 1 Q1893_p2039_2016 381 Q1893_p2039 2016
# 4 2 Q1893_p2039_2016 293 Q1893_p2039 2016
# 5 1 Q1000_p2000_2017 422 Q1000_p2000 2017
# 6 2 Q1000_p2000_2017 398 Q1000_p2000 2017
# 7 1 Q1893_p2039_2017 351 Q1893_p2039 2017
# 8 2 Q1893_p2039_2017 310 Q1893_p2039 2017
Your problem can then be solved like this:
new_resutls %>%
dplyr::group_by(ID, question) %>%
dplyr::summarise(absdiff = abs(sum(value*c(1, -1))))
# ID question absdiff
# <dbl> <chr> <dbl>
# 1 1 Q1000_p2000 30
# 2 1 Q1893_p2039 30
# 3 2 Q1000_p2000 34
# 4 2 Q1893_p2039 17

Mapping a dataframe (with NA) to an n by n adjacency matrix (as a data.frame object)

I have a three-column dataframe object recording the bilateral trade data between 161 countries, the data are of dyadic format containing 19687 rows, three columns (reporter (rid), partner (pid), and their bilateral trade flow (TradeValue) in a given year). rid or pid takes a value from 1 to 161, and a country is assigned the same rid and pid. For any given pair of (rid, pid) in which rid =/= pid, TradeValue(rid, pid) = TradeValue(pid, rid).
The data (run in R) look like this:
#load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
The data were sourced from UN Comtrade database, each rid is paired with multiple pid to get their bilateral trade data, but as can be seen, not every pid has a numeric id value because I only assigned a rid or pid to a country if a list of relevant economic indicators of that country are available, which is why there are NA in the data despite TradeValue exists between that country and the reporting country (rid). The same applies when a country become a "reporter," in that situation, that country did not report any TradeValue with partners, and its id number is absent from the rid column. (Hence, you can see rid column begins with 2, because country 1 (i.e., Afghanistan) did not report any bilateral trade data with partners). A quick check with summary statistics helps confirm this
length(unique(example_data$rid))
[1] 139
# only 139 countries reported bilateral trade statistics with partners
length(unique(example_data$pid))
[1] 162
# that extra pid is NA (161 + NA = 162)
Since most countries report bilateral trade data with partners and for those who don't, they tend to be small economies. Hence, I want to preserve the complete list of 161 countries and transform this example_data dataframe into a 161 x 161 adjacency matrix in which
for those countries that are absent from the rid column (e.g., rid == 1), create each of them a row and set the entire row (in the 161 x 161 matrix) to 0.
for those countries (pid) that do not share TradeValue entries with a particular rid, set those cells to 0.
For example, suppose in a 5 x 5 adjacency matrix, country 1 did not report any trade statistics with partners, the other four reported their bilateral trade statistics with other (except country 1). The original dataframe is like
rid pid TradeValue
2 3 223
2 4 13
2 5 9
3 2 223
3 4 57
3 5 28
4 2 13
4 3 57
4 5 82
5 2 9
5 3 28
5 4 82
from which I want to convert it to a 5 x 5 adjacency matrix (of data.frame format), the desired output should look like this
V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 223 13 9
3 0 223 0 57 28
4 0 13 57 0 82
5 0 9 28 82 0
And using the same method on the example_data to create a 161 x 161 adjacency matrix. However, after a couple trial and error with reshape and other methods, I still could not get around with such conversion, not even beyond the first step.
It will be really appreciated if anyone could enlighten me on this?
I cannot read the dropbox file but have tried to work off of your 5-country example dataframe -
country_num = 5
# check countries missing in rid and pid
rid_miss = setdiff(1:country_num, example_data$rid)
pid_miss = ifelse(length(setdiff(1:country_num, example_data$pid) == 0),
1, setdiff(1:country_num, example_data$pid))
# create dummy dataframe with missing rid and pid
add_data = as.data.frame(do.call(cbind, list(rid_miss, pid_miss, NA)))
colnames(add_data) = colnames(example_data)
# add dummy dataframe to original
example_data = rbind(example_data, add_data)
# the dcast now takes missing rid and pid into account
mat = dcast(example_data, rid ~ pid, value.var = "TradeValue")
# can remove first column without setting colnames but this is more failproof
rownames(mat) = mat[, 1]
mat = as.matrix(mat[, -1])
# fill in upper triangular matrix with missing values of lower triangular matrix
# and vice-versa since TradeValue(rid, pid) = TradeValue(pid, rid)
mat[is.na(mat)] = t(mat)[is.na(mat)]
# change NAs to 0 according to preference - would keep as NA to differentiate
# from actual zeros
mat[is.na(mat)] = 0
Does this help?

R loop through multiple sub groups with using functions

Hi I am trying to learn how to loop through multiple groups within a data frame and apply certain arithmetic operations. I do not have a programming background and am struggling to loop through the multiple conditions.
My data looks like the following:
Event = c(1,1,1,1,1,2,2,2,2,2)
Indiv1=c(4,5,6,11,45,66,8,9,32,45)
Indiv2=c(7,81,91,67,12,34,56,78,90,12)
Category=c(1,1,2,2,2,1,2,2,1,1)
Play_together=c(1,0,1,1,1,1,1,1,0,1)
Money=c(23,11,78,-9,-12,345,09,43,21,90)
z = data.frame(Event,Indiv1,Indiv2,Category,Play_together,Money)
What I would like to do is to look through each event and each category and take the average value of Money in cases where Play_together == 1. When Play_together==0, then I would like to apply Money/100.
I understand that the loop would look something like the following:
for i in 1:nrow(z){
#loop for event{
#loop for Category{
#Define avg or division function
}
}
}
However, I cannot seem to implement this using a nested loop. I saw another post (link: apply function for each subgroup) which uses dplyr package. I was wondering if someone could help me to implement this without using any packages (I know this might take longer as compared to using R packages). I am trying to learn R and this is the first time I am working with nested loops.
The final output will look like this:
where for event 1, the following holds:
a) For cateory 1:
Play_together ==1 in row 1; we take the avg of Money value and hence final output = 23/1= 23
Play_together==0 in row 2; we take Money/100= 0.11
b) For category 2:
Play_together == 1 for all observations. We take avg Money for all three observations.
This holds similarly for Event 2. In my actual dataset, I have event = 600 and number of category ranging from 1 - 10. Some events may have only 1 category and a maximum of 10 categories. So any function needs to be extremely flexible. The total number of observations in my dataset is around 1.5 million so any changes in the looping process to reduce the time taken to carry out the operation is going to be extremely helpful (Although at this stage my priority is the looping process itself).
Once again it would be a great help if you can show me how to use nested looping and explain the steps in brief. Much appreciated.
will something like this do?
I know it's using dplyr, but that package is made for this kind of jobs ;-)
Event = c(1,1,1,1,1,2,2,2,2,2)
Indiv1=c(4,5,6,11,45,66,8,9,32,45)
Indiv2=c(7,81,91,67,12,34,56,78,90,12)
Category=c(1,1,2,2,2,1,2,2,1,1)
Play_together=c(1,0,1,1,1,1,1,1,0,1)
Money=c(23,11,78,-9,-12,345,09,43,21,90)
z = data.frame(Event,Indiv1,Indiv2,Category,Play_together,Money)
library(dplyr)
df_temp <- z %>%
group_by( Event, Category, Play_together ) %>%
summarise( money_mean = mean( Money ) ) %>%
mutate( final_output = ifelse( Play_together == 0, money_mean / 100, money_mean )) %>%
select( -money_mean )
df <- z %>%
left_join(df_temp, by = c("Event", "Category", "Play_together" )) %>%
arrange(Event, Category)
Consider base R's by, the object-oriented wrapper to tapply designed to subset dataframes by factor(s) but unlike split can pass subsets into a defined function. Then, run conditional logic with ifelse for Final_Output field. Finally, stack all subsetted dataframes for final object.
# LIST OF DATAFRAMES
by_list <- by(z, z[c("Event", "Category")], function(sub) {
tmp <- subset(sub, Play_together==1)
sub$Final_Output <- ifelse(sub$Play_together == 1, mean(tmp$Money), sub$Money/100)
return(sub)
})
# APPEND ALL DATAFRAMES
final_df <- do.call(rbind, by_list)
row.names(final_df) <- NULL
final_df
# Event Indiv1 Indiv2 Category Play_together Money Final_Output
# 1 1 4 7 1 1 23 23.00
# 2 1 5 81 1 0 11 0.11
# 3 2 66 34 1 1 345 217.50
# 4 2 32 90 1 0 21 0.21
# 5 2 45 12 1 1 90 217.50
# 6 1 6 91 2 1 78 19.00
# 7 1 11 67 2 1 -9 19.00
# 8 1 45 12 2 1 -12 19.00
# 9 2 8 56 2 1 9 26.00
# 10 2 9 78 2 1 43 26.00

Extracting corresponding other values in mutate when group_by is applied

I have a data frame with patient data and measurements of different variables over time.
The data frame looks a bit like this but more lab-values variables:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,2),
time=c(0,3,7,35,0,7,14,28,42),
labvalue1=c(4.04,NA,2.93,NA,NA,3.78,3.66,NA,2.54),
labvalue2=c(NA,63.8,62.8,61.2,78.1,NA,77.6,75.3,NA))
> df2
id time labvalue1 labvalue2
1 1 0 4.04 NA
2 1 3 NA 63.8
3 1 7 2.93 62.8
4 1 35 NA 61.2
5 2 0 NA 78.1
6 2 7 3.78 NA
7 2 14 3.66 77.6
8 2 28 NA 75.3
9 2 42 2.54 NA
I want to calculate for each patient (with unique ID) the decrease or slope per day for the first and last measurement. To compare the slopes between patients. Time is in days. So, eventually I want a new variable, e.g. diff_labvalues - for each value, that gives me for labvalue1:
For patient 1: (2.93-4.04)/ (7-0) and for patient 2: (2.54-3.78)/(42-7) (for now ignoring the measurements in between, just last-first); etc for labvalue2, and so forth.
So far I have used dplyr, created the first1 and last1 functions, because first() and last() did not work with the NA values.
Thereafter, I have grouped_by 'id', used mutate_all (because there are more lab-values in the original df) calculated the difference between the last1() and first1() lab-values for that patient.
But cannot find HOW to extract the values of the corresponding time values (the delta-time value) which I need to calculate the slope of the decline.
Eventually I want something like this (last line):
first1 <- function(x) {
first(na.omit(x))
}
last1 <- function(x) {
last(na.omit(x))
}
df2 = df %>%
group_by(id) %>%
mutate_all(funs(diff=(last1(.)-first1(.)) / #it works until here
(time[position of last1(.)]-time[position of first1(.)]))) #something like this
Not sure if tidyverse even has a solution for this, so any help would be appreciated. :)
We can try
df %>%
group_by(id) %>%
filter(!is.na(labs)) %>%
summarise(diff_labs = (last(labs) - first(labs))/(last(time) - first(time)))
# A tibble: 2 x 2
# id diff_labs
# <dbl> <dbl>
#1 1 -0.15857143
#2 2 -0.03542857
and
> (2.93-4.04)/ (7-0)
#[1] -0.1585714
> (2.54-3.78)/(42-7)
#[1] -0.03542857
Or another option is data.table
library(data.table)
setDT(df)[!is.na(labs), .(diff_labs = (labs[.N] - labs[1])/(time[.N] - time[1])) , id]
# id diff_labs
#1: 1 -0.15857143
#2: 2 -0.03542857

Resources