With the dataset df:
df
confint row Index
0.3407,0.4104 1 1
0.2849,0.4413 2 2
0.2137,0.2674 3 3
0.1910,0.4575 4 1
0.4039,0.4905 5 2
0.403,0.4822 6 3
0.0301,0.0646 7 1
0.0377,0.0747 8 2
0.0835,0.0918 9 3
0.0437,0.0829 10 1
0.0417,0.0711 11 2
0.0718,0.0798 12 3
0.0112,0.0417 13 1
0.019,0.0237 14 2
0.0213,0.0293 15 3
0.0121,0.0393 16 1
0.0126,0.0246 17 2
0.0318,0.0428 18 3
0.0298,0.0631 19 1
0.018,0.0202 20 2
0.1031,0.1207 21 3
This should be a rather easy dataset to convert from long to wide form that is a 7 (row) x 3 (column) dataframe. The result should have 3 columns named by Index and 7 rows (21/3 = 7). The code is as of the following:
df <- spread(df,Index, confint, convert = FALSE)
However, by using Spread() I received the following error:
Error: Duplicate identifiers for rows (1, 4, 7, 10, 13, 16, 19), (2, 5, 8, 11, 14, 17, 20), (3, 6, 9, 12, 15, 18, 21)
Any help will be greatly appreciated!
We need to create a sequence column and then spread
library(tidyverse)
df %>%
group_by(Index) %>%
mutate(ind = row_number()) %>%
spread(Index, confint, convert = FALSE)
NOTE: This would be an issue in the original dataset and not in the example data showed in the post
Related
I am trying to find a tidyverse-based programmatic approach to calculating the number of variables meeting a condition in a dataset. The data contains a row for each individual and a variable containing a code that describes something about that individual. I need to efficiently create a tally of how many times that variable code meets multiple sets of criteria. My current process uses dplyr's mutate along with row-wise summing within a tidyverse pipeline to create the required tally.
Other similar posts to this answer the question by summing rowwise, as I already do. In practice, this approach results in an extensive amount of code and slow processing since there are five variables, thousands of individuals, and a dozen criteria to tally separately.
Here is a demonstration of what I've tried so far. The desired output here is calculated as if the condition were for the code in the variables to match 20 or 24.
## Sample data and result
sample <- tibble(
subjectNum = 1:10,
var1 = c(20, 24, 20, 1, 24, 27, 7, 21, 20, 3),
var2 = c(24, 20, 7, 19, 12, 8, 8, 10, 22, NA),
var3 = c(NA, NA, 24, 20, NA, 20, 9, 3, 24, NA),
desired_output = c(2, 2, 2, 1, 1, 1, 0, 0, 2, 0)
)
sample_calc <- sample %>%
rowwise() %>%
mutate(output = sum(var1 %in% c(20, 24), var2 %in% c(20, 24), var3 %in% c(20, 24), na.rm= TRUE))
all(sample_calc$output == sample_calc$desired_output) # should return TRUE
The actual analysis requires conducting such a test for multiple sets of criteria that are available in a separate data file. It also requires the data structure to generally be maintained, so solutions using pivot_longer to count the variables fail as well.
We may use the vectorized rowSums by looping across the columns that starts_with 'var', create the condition within the loop and do the rowSums on the logical columns. It should be more efficient than rowwise sum
library(dplyr)
sample %>%
mutate(output = rowSums(across(starts_with('var'),
~ .x %in% c(20, 24)), na.rm = TRUE))
-output
# A tibble: 10 × 6
subjectNum var1 var2 var3 desired_output output
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 20 24 NA 2 2
2 2 24 20 NA 2 2
3 3 20 7 24 2 2
4 4 1 19 20 1 1
5 5 24 12 NA 1 1
6 6 27 8 20 1 1
7 7 7 8 9 0 0
8 8 21 10 3 0 0
9 9 20 22 24 2 2
10 10 3 NA NA 0 0
I have a very long data frame (~10,000 rows), in which two of the columns look something like this.
A B
1 5.5
1 5.5
2 201
9 18
9 18
2 201
9 18
... ...
Just scrubbing through the data it seems that the two columns are "paired" together, but is there any way of explicitly checking this?
You want to know if value x in column A always means value y in column B?
Let's group by A and count the distinct values in B:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.5, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 1
2 2 1
3 9 1
If we now alter the df to the case that this is not true:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.4, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 2
2 2 1
3 9 1
Observe the increased count for group 1. As you have more than 10000 rows, what remains is to see whether or not there is at least one instance that has n_unique > 1, for instance by filter(n_unique > 1)
If you run this you will see how many unique values of B there are for each value of A
tapply(dat$B, dat$A, function(x) length(unique(x)))
So if the max of this vector is 1 then there are no values of A that have more than one corresponding value of B.
I have the following tibble (but in reality with many more rows): it is called education_tibble
library(tidyverse)
education_tibble <- tibble(
ghousecode = c(1011027, 1011017, 1011021, 1011019, 1011025, 1011017,
1011016, 1011021, 1011017, 1011019),
hhc_educ = c(2, 0, 11, 10, 14, 4, 8, 16, 0, 9))
ghousecode hhc_educ
<dbl> <dbl>
1 1011027 2
2 1011017 0
3 1011021 11
4 1011019 10
5 1011025 14
6 1011017 4
7 1011016 8
8 1011021 16
9 1011017 0
10 1011019 9
I am trying to sum the hhc_educ so that each ghousecode has a corresponding "total hhc_educ". I am struggling to do this, and not sure what to do. I have been using the tidyverse, so have been exploring ways mostly within dplyr. Here is my code:
education_tibble %>%
group_by(ghousecode, add = TRUE)
summarize(total_educ = sum(hhc_educ))
The problem is that this code generates just one value for some reason, not a total_educ value for each group. Essentially I'm after a new tibble ultimately which will have each ghousecode in one row with the sum of all of the hhc_educ values next to it. Any help would be much appreciated! Thank you!
You missed a %>% I think.
library(tidyverse)
#data
education_tibble <- tibble(
ghousecode = c(1011027, 1011017, 1011021, 1011019, 1011025, 1011017,
1011016, 1011021, 1011017, 1011019),
hhc_educ = c(2, 0, 11, 10, 14, 4, 8, 16, 0, 9))
# grouped count
education_tibble %>%
group_by(ghousecode, add = TRUE) %>%
summarise(total_educ = sum(hhc_educ))
Produces:
# A tibble: 6 x 2
ghousecode total_educ
<dbl> <dbl>
1 1011016 8
2 1011017 4
3 1011019 19
4 1011021 27
5 1011025 14
6 1011027 2
This is an exemplary piece of code:
df<-data.frame(c(14, 37, 15, 18, 1, 7))
df$rankk=rank(-df)
and this is the result:
Rank
14 4
37 1
15 3
18 2
1 6
7 5
Now I want something like the rows position also changes according to their ranks.
Desired:
37 1
18 2
15 3
14 4
7 5
1 6
Thanks in advance
In base R you're looking for order
df <- df[order(df$rankk), ]
In the tidyverse world you'd use arrange:
df %<>%
arrange(rankk)
Or better yet skip creating the rank column at all,
df <- data.frame(x = c(14, 37, 15, 18, 1, 7))
# base R version
df <- df[order(-df$x), ]
# tidyverse version
df %<>%
arrange(desc(x))
I have a dataframe with values defined per bucket. (See df1 below)
Now I have another dataframe with values within those buckets for which I want to look up a value from the bucketed dataframe (See df2 below)
Now I would like to have the result df3 below.
df1 <- data.frame(MIN = c(1,4,8), MAX = c(3, 6, 10), VALUE = c(3, 56, 8))
df2 <- data.frame(KEY = c(2,5,9))
df3 <- data.frame(KEY = c(2,5,9), VALUE = c(3, 56, 8))
> df1
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 8
> df2
KEY
1 2
2 5
3 9
> df3
KEY VALUE
1 2 3
2 5 56
3 9 8
EDIT :
Extended the example.
> df1 <- data.frame(MIN = c(1,4,8, 14), MAX = c(3, 6, 10, 18), VALUE = c(3, 56, 3, 5))
> df2 <- data.frame(KEY = c(2,5,9,18,3))
> df3 <- data.frame(KEY = c(2,5,9,18,3), VALUE = c(3, 56, 3, 5, 3))
> df1
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 3
4 14 18 5
> df2
KEY
1 2
2 5
3 9
4 18
5 3
> df3
KEY VALUE
1 2 3
2 5 56
3 9 3
4 18 5
5 3 3
This solution assumes that KEY, MIN and MAX are integers, so we can create a sequence of keys and then join.
df1 <- data.frame(MIN = c(1,4,8, 14), MAX = c(3, 6, 10, 18), VALUE = c(3, 56, 3, 5))
df2 <- data.frame(KEY = c(2,5,9,18,3))
library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
group_by(VALUE, id=row_number()) %>% # for each value and row id
nest() %>% # nest rest of columns
mutate(KEY = map(data, ~seq(.$MIN, .$MAX))) %>% # create a sequence of keys
unnest(KEY) %>% # unnest those keys
right_join(df2, by="KEY") %>% # join the other dataset
select(KEY, VALUE)
# # A tibble: 5 x 2
# KEY VALUE
# <dbl> <dbl>
# 1 2.00 3.00
# 2 5.00 56.0
# 3 9.00 3.00
# 4 18.0 5.00
# 5 3.00 3.00
Or, group just by the row number and add VALUE in the map:
df1 %>%
group_by(id=row_number()) %>%
nest() %>%
mutate(K = map(data, ~data.frame(VALUE = .$VALUE,
KEY = seq(.$MIN, .$MAX)))) %>%
unnest(K) %>%
right_join(df2, by="KEY") %>%
select(KEY, VALUE)
A very good and well-thought-out solution from #AntioniosK.
Here's a base R solution implemented as a general lookup function given as arguments a key dataframe and a bucket dataframe defined as listed in the question. The lookup values need not be unique or contiguous in this example, taking account of #Michael's comment that values may occur in more than one row (though normally such lookups would use unique ranges).
lookup = function(keydf, bucketdf){
keydf$rowid = 1:nrow(keydf)
T = merge(bucketdf, keydf)
T = T[T$KEY >= T$MIN & T$KEY <= T$MAX,]
T = merge(T, keydf, all.y = TRUE)
T[order(T$rowid), c("rowid", "KEY", "VALUE")]
}
The first merge uses a Cartesian join of all rows in the key to all rows in the bucket list. Such joins can be inefficient if the number of rows in the real tables is large, as the result of joining x rows in the key to y rows in the bucket would be xy rows; I doubt this would be a problem in this case unless x or y run into thousands of rows.
The second merge is done to recover any key values which are not matched to rows in the bucket list.
Using the example data as listed in #AntioniosK's post:
> lookup(df2, df1)
rowid KEY VALUE
2 1 2 3
4 2 5 56
5 3 9 3
1 4 18 5
3 5 3 3
Using key and bucket exemplars that test edge cases (where the key = the min or the max), where a key value is not in the bucket list (the value 50 in df2A), and where there is a non-unique range (row 6 of df4 below):
df4 <- data.frame(MIN = c(1,4,8, 20, 30, 22), MAX = c(3, 6, 10, 25, 40, 24), VALUE = c(3, 56, 8, 10, 12, 23))
df2A <- data.frame(KEY = c(3, 6, 22, 30, 50))
df4
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 8
4 20 25 10
5 30 40 12
6 22 24 23
> df2A
KEY
1 3
2 6
3 22
4 30
5 50
> lookup(df2A, df4)
rowid KEY VALUE
1 1 3 3
2 2 6 56
3 3 22 10
4 3 22 23
5 4 30 12
6 5 50 NA
As shown above, the lookup in this case returns two values for the non-unique ranges matching the key value 22, and NA for values in the key but not in the bucket list.