Checking if columns in dataframe are "paired" - r

I have a very long data frame (~10,000 rows), in which two of the columns look something like this.
A B
1 5.5
1 5.5
2 201
9 18
9 18
2 201
9 18
... ...
Just scrubbing through the data it seems that the two columns are "paired" together, but is there any way of explicitly checking this?

You want to know if value x in column A always means value y in column B?
Let's group by A and count the distinct values in B:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.5, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 1
2 2 1
3 9 1
If we now alter the df to the case that this is not true:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.4, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 2
2 2 1
3 9 1
Observe the increased count for group 1. As you have more than 10000 rows, what remains is to see whether or not there is at least one instance that has n_unique > 1, for instance by filter(n_unique > 1)

If you run this you will see how many unique values of B there are for each value of A
tapply(dat$B, dat$A, function(x) length(unique(x)))
So if the max of this vector is 1 then there are no values of A that have more than one corresponding value of B.

Related

R - Efficiently calculate the number of columns meeting a condition for each observation in a dataset

I am trying to find a tidyverse-based programmatic approach to calculating the number of variables meeting a condition in a dataset. The data contains a row for each individual and a variable containing a code that describes something about that individual. I need to efficiently create a tally of how many times that variable code meets multiple sets of criteria. My current process uses dplyr's mutate along with row-wise summing within a tidyverse pipeline to create the required tally.
Other similar posts to this answer the question by summing rowwise, as I already do. In practice, this approach results in an extensive amount of code and slow processing since there are five variables, thousands of individuals, and a dozen criteria to tally separately.
Here is a demonstration of what I've tried so far. The desired output here is calculated as if the condition were for the code in the variables to match 20 or 24.
## Sample data and result
sample <- tibble(
subjectNum = 1:10,
var1 = c(20, 24, 20, 1, 24, 27, 7, 21, 20, 3),
var2 = c(24, 20, 7, 19, 12, 8, 8, 10, 22, NA),
var3 = c(NA, NA, 24, 20, NA, 20, 9, 3, 24, NA),
desired_output = c(2, 2, 2, 1, 1, 1, 0, 0, 2, 0)
)
sample_calc <- sample %>%
rowwise() %>%
mutate(output = sum(var1 %in% c(20, 24), var2 %in% c(20, 24), var3 %in% c(20, 24), na.rm= TRUE))
all(sample_calc$output == sample_calc$desired_output) # should return TRUE
The actual analysis requires conducting such a test for multiple sets of criteria that are available in a separate data file. It also requires the data structure to generally be maintained, so solutions using pivot_longer to count the variables fail as well.
We may use the vectorized rowSums by looping across the columns that starts_with 'var', create the condition within the loop and do the rowSums on the logical columns. It should be more efficient than rowwise sum
library(dplyr)
sample %>%
mutate(output = rowSums(across(starts_with('var'),
~ .x %in% c(20, 24)), na.rm = TRUE))
-output
# A tibble: 10 × 6
subjectNum var1 var2 var3 desired_output output
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 20 24 NA 2 2
2 2 24 20 NA 2 2
3 3 20 7 24 2 2
4 4 1 19 20 1 1
5 5 24 12 NA 1 1
6 6 27 8 20 1 1
7 7 7 8 9 0 0
8 8 21 10 3 0 0
9 9 20 22 24 2 2
10 10 3 NA NA 0 0

slice lowest positive value in R

I have a dataset looking like this:
df <- data.frame(ID=c(1, 1, 1, 2, 2, 3), days=c(100, 10, -8, -5, 12, 10))
Now I only want to take the lowest positive value of "days" so that my output would look like this:
new <- data.frame(ID=c(1, 2, 3), days=c(10, 12, 10))
I have thought about this:
df%>%
group_by(ID)%>%
slice_min(days)
But of course this will return the lowest number number also if it is negative. What can I do to only get the lowest positive values?
Preferably using dplyr.
Thanks so much!
filtering only positve values for days should do.
df <- data.frame(ID=c(1, 1, 1, 2, 2, 3), days=c(100, 10, -8, -5, 12, 10))
library(dplyr)
df %>%
group_by(ID) %>%
filter(days>0) %>%
slice_min(days)
#> # A tibble: 3 x 2
#> # Groups: ID [3]
#> ID days
#> <dbl> <dbl>
#> 1 1 10
#> 2 2 12
#> 3 3 10
You can use aggregate()
aggregate(days ~ ID, df, function(x){
min(x[x > 0])
})
# ID days
# 1 1 10
# 2 2 12
# 3 3 10

How to sum values from multiple rows of a variable linked by household ID

I have the following tibble (but in reality with many more rows): it is called education_tibble
library(tidyverse)
education_tibble <- tibble(
ghousecode = c(1011027, 1011017, 1011021, 1011019, 1011025, 1011017,
1011016, 1011021, 1011017, 1011019),
hhc_educ = c(2, 0, 11, 10, 14, 4, 8, 16, 0, 9))
ghousecode hhc_educ
<dbl> <dbl>
1 1011027 2
2 1011017 0
3 1011021 11
4 1011019 10
5 1011025 14
6 1011017 4
7 1011016 8
8 1011021 16
9 1011017 0
10 1011019 9
I am trying to sum the hhc_educ so that each ghousecode has a corresponding "total hhc_educ". I am struggling to do this, and not sure what to do. I have been using the tidyverse, so have been exploring ways mostly within dplyr. Here is my code:
education_tibble %>%
group_by(ghousecode, add = TRUE)
summarize(total_educ = sum(hhc_educ))
The problem is that this code generates just one value for some reason, not a total_educ value for each group. Essentially I'm after a new tibble ultimately which will have each ghousecode in one row with the sum of all of the hhc_educ values next to it. Any help would be much appreciated! Thank you!
You missed a %>% I think.
library(tidyverse)
#data
education_tibble <- tibble(
ghousecode = c(1011027, 1011017, 1011021, 1011019, 1011025, 1011017,
1011016, 1011021, 1011017, 1011019),
hhc_educ = c(2, 0, 11, 10, 14, 4, 8, 16, 0, 9))
# grouped count
education_tibble %>%
group_by(ghousecode, add = TRUE) %>%
summarise(total_educ = sum(hhc_educ))
Produces:
# A tibble: 6 x 2
ghousecode total_educ
<dbl> <dbl>
1 1011016 8
2 1011017 4
3 1011019 19
4 1011021 27
5 1011025 14
6 1011027 2

Filter and select a dataset based on a value in a row

I have looked into dplyr and tidyr and even base R but I cannot seem to figure out how to subset my data based on a row value.
I have tried using dplyr filter() and select() functions but because gender, language, and age are in the id column, I cannot filter by just typing data %>% filter(gender == 1).
I have a list of 50 raters. For the example here I will display 5. I have 183 rows, which include the raters answers to each question and the three last rows have demographic data, such as age, gender and whether someone is a native or non-native speaker. I will illustrate here with 6 rows as an example.
What I am trying to do is find a way to subset my data according to the values in the age, gender, and language values. Let's say I want to select all the ratings for gender 1, or for language 1, or for gender 1 AND language 1.
Thank you.
Code:
data <- data.frame("id" = c(901,902,903,"age",
"gender",
"language"),
"rater1" = c(7, 9, 9, 21, 1, 1),
"rater2" = c(9, 9, 9, 39, 2, 2),
"rater3" = c(9, 9, 9, 38, 2, 1),
"rater4" = c(9, 9, 9, 33, 2, 1),
"rater5" = c(2, 9, 9, 21, 2, 1))
In order to filter by gender and the other variables of interest we will need to rearrange the data so that they are columns and not rows within a column. One way we can do that is to use gather and then spread. After changing the structure you can utilize dplyr filtering.
data <- data %>%
gather("Rater",rater1:rater5, value = "Value") %>%
spread(id, value = Value) %>%
filter(gender == 1)
Well, I am not sure whether this scales well for your use case but you could do basic indexing:
# data
x <- data.frame("id" = c(901,902,903,"age","gender","language"),
"rater1" = c(7, 9, 9, 21, 1, 1),
"rater2" = c(9, 9, 9, 39, 2, 2),
"rater3" = c(9, 9, 9, 38, 2, 1),
"rater4" = c(9, 9, 9, 33, 2, 1),
"rater5" = c(2, 9, 9, 21, 2, 1))
# ensure id is character and not factor
x$id <- as.character(x$id)
# select all raters whose gender or language is 1
x[, c(TRUE, x[x$id == "gender", -1] == 1) |
c(TRUE, x[x$id == "language", -1] == 1) ]
The TRUE ensures that the id column is kept in any case and the -1 ensures that the logical vector has the desired length (number of columns).
I'd suggest working with two data frames, one (I call demo) for the demographic information on raters, 1 row per rater, and one (I call ratings) for the ratings each rater gave, 1 row per response:
library(tidyr)
library(dplyr)
demo = tail(data, 3)
ratings = head(data, -3)
demo_cols = demo$id
demo = data.frame(t(demo[-1]))
names(demo) = demo_cols
demo$rater = as.numeric(sub(pattern = "rater", replacement = "", rownames(demo)))
demo
# age gender language rater
# rater1 21 1 1 1
# rater2 39 2 2 2
# rater3 38 2 1 3
# rater4 33 2 1 4
# rater5 21 2 1 5
ratings = tidyr::pivot_longer(ratings, cols = starts_with("rater"),
names_to = "rater", names_prefix = "rater") %>%
mutate(rater = as.numeric(rater))
ratings
# # A tibble: 15 x 3
# id rater value
# <fct> <dbl> <dbl>
# 1 901 1 7
# 2 901 2 9
# 3 901 3 9
# 4 901 4 9
# 5 901 5 2
# 6 902 1 9
# ...
Then, when you want to do something like "select all the ratings for gender 1, or for language 1, or for gender 1 AND language 1", you do a simple filter of demo, and join to the ratings data to get the matching records:
demo %>% filter(gender == 1 & language == 1) %>%
inner_join(ratings)
# Joining, by = "rater"
# age gender language rater id value
# 1 21 1 1 1 901 7
# 2 21 1 1 1 902 9
# 3 21 1 1 1 903 9
You could also do the complete join
ratings_with_demo = inner_join(ratings, demo) and filter that data frame directly. But remember if you do this that each row is a response. If you want to do something like count the number of raters by gender, the demo data frame is a much nicer starting place.
Just turn it on its side. Make sure to turn id into row names first, and then remove id to prevent type coercion. t also returns a matrix, so you'll need to turn the data back into a data frame with as_tibble or as.data.frame:
library(dplyr)
data <- as_tibble(t(`rownames<-`(data, data$id)[-1]))
Now filter should do what you expect:
data %>% filter(gender == 1)
#### OUTPUT ####
# A tibble: 1 x 6
`901` `902` `903` age gender language
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7 9 9 21 1 1

From dataframe with values per min max to value per key

I have a dataframe with values defined per bucket. (See df1 below)
Now I have another dataframe with values within those buckets for which I want to look up a value from the bucketed dataframe (See df2 below)
Now I would like to have the result df3 below.
df1 <- data.frame(MIN = c(1,4,8), MAX = c(3, 6, 10), VALUE = c(3, 56, 8))
df2 <- data.frame(KEY = c(2,5,9))
df3 <- data.frame(KEY = c(2,5,9), VALUE = c(3, 56, 8))
> df1
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 8
> df2
KEY
1 2
2 5
3 9
> df3
KEY VALUE
1 2 3
2 5 56
3 9 8
EDIT :
Extended the example.
> df1 <- data.frame(MIN = c(1,4,8, 14), MAX = c(3, 6, 10, 18), VALUE = c(3, 56, 3, 5))
> df2 <- data.frame(KEY = c(2,5,9,18,3))
> df3 <- data.frame(KEY = c(2,5,9,18,3), VALUE = c(3, 56, 3, 5, 3))
> df1
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 3
4 14 18 5
> df2
KEY
1 2
2 5
3 9
4 18
5 3
> df3
KEY VALUE
1 2 3
2 5 56
3 9 3
4 18 5
5 3 3
This solution assumes that KEY, MIN and MAX are integers, so we can create a sequence of keys and then join.
df1 <- data.frame(MIN = c(1,4,8, 14), MAX = c(3, 6, 10, 18), VALUE = c(3, 56, 3, 5))
df2 <- data.frame(KEY = c(2,5,9,18,3))
library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
group_by(VALUE, id=row_number()) %>% # for each value and row id
nest() %>% # nest rest of columns
mutate(KEY = map(data, ~seq(.$MIN, .$MAX))) %>% # create a sequence of keys
unnest(KEY) %>% # unnest those keys
right_join(df2, by="KEY") %>% # join the other dataset
select(KEY, VALUE)
# # A tibble: 5 x 2
# KEY VALUE
# <dbl> <dbl>
# 1 2.00 3.00
# 2 5.00 56.0
# 3 9.00 3.00
# 4 18.0 5.00
# 5 3.00 3.00
Or, group just by the row number and add VALUE in the map:
df1 %>%
group_by(id=row_number()) %>%
nest() %>%
mutate(K = map(data, ~data.frame(VALUE = .$VALUE,
KEY = seq(.$MIN, .$MAX)))) %>%
unnest(K) %>%
right_join(df2, by="KEY") %>%
select(KEY, VALUE)
A very good and well-thought-out solution from #AntioniosK.
Here's a base R solution implemented as a general lookup function given as arguments a key dataframe and a bucket dataframe defined as listed in the question. The lookup values need not be unique or contiguous in this example, taking account of #Michael's comment that values may occur in more than one row (though normally such lookups would use unique ranges).
lookup = function(keydf, bucketdf){
keydf$rowid = 1:nrow(keydf)
T = merge(bucketdf, keydf)
T = T[T$KEY >= T$MIN & T$KEY <= T$MAX,]
T = merge(T, keydf, all.y = TRUE)
T[order(T$rowid), c("rowid", "KEY", "VALUE")]
}
The first merge uses a Cartesian join of all rows in the key to all rows in the bucket list. Such joins can be inefficient if the number of rows in the real tables is large, as the result of joining x rows in the key to y rows in the bucket would be xy rows; I doubt this would be a problem in this case unless x or y run into thousands of rows.
The second merge is done to recover any key values which are not matched to rows in the bucket list.
Using the example data as listed in #AntioniosK's post:
> lookup(df2, df1)
rowid KEY VALUE
2 1 2 3
4 2 5 56
5 3 9 3
1 4 18 5
3 5 3 3
Using key and bucket exemplars that test edge cases (where the key = the min or the max), where a key value is not in the bucket list (the value 50 in df2A), and where there is a non-unique range (row 6 of df4 below):
df4 <- data.frame(MIN = c(1,4,8, 20, 30, 22), MAX = c(3, 6, 10, 25, 40, 24), VALUE = c(3, 56, 8, 10, 12, 23))
df2A <- data.frame(KEY = c(3, 6, 22, 30, 50))
df4
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 8
4 20 25 10
5 30 40 12
6 22 24 23
> df2A
KEY
1 3
2 6
3 22
4 30
5 50
> lookup(df2A, df4)
rowid KEY VALUE
1 1 3 3
2 2 6 56
3 3 22 10
4 3 22 23
5 4 30 12
6 5 50 NA
As shown above, the lookup in this case returns two values for the non-unique ranges matching the key value 22, and NA for values in the key but not in the bucket list.

Resources