combining datasets with known identity variable - r

So lets take the following data
set.seed(123)
A <- 1:10
age<- sample(20:50,10)
height <- sample(100:210,10)
df1 <- data.frame(A, age, height)
B <- c(1,1,1,2,2,3,3,5,5,5,5,8,8,9,10,10)
injury <- sample(letters[1:5],16, replace=T)
df2 <- data.frame(B, injury)
Now, we can merge the data using the following code:
df3 <- merge(df1, df2, by.x = "A", by.y = "B", all=T)
head(df3)
# A age height injury
# 1 1 28 206 e
# 2 1 28 206 d
# 3 1 28 206 d
# 4 2 43 149 e
# 5 2 43 149 d
# 6 3 31 173 d
But what i want in the new data frame is the length of injury's as a level variable.
So the desired output should look like this:
So in this simple example we know that the max length of injury's is 4 per unique df2$B . So we need 4 new columns.
Must my data has an unknown number, so a code is needed to generate the correct, so something like
length(unique(df2$injury[df2$B]))
but that is also not correct syntax, as the output should equal 4

I don't know where the letters are coming from in your sample output, because there are none in the variables in your sample input, but you can try something like:
library(splitstackshape)
dcast.data.table(getanID(df3, c("A", "age")), A + age + height ~
.id, value.var = "injury")
## A age height 1 2 3 4
## 1: 1 28 206 4 3 3 NA
## 2: 2 43 149 4 3 NA NA
## 3: 3 31 173 3 3 NA NA
## 4: 4 44 161 NA NA NA NA
## 5: 5 45 111 3 2 1 4
## 6: 6 21 195 NA NA NA NA
## 7: 7 33 125 NA NA NA NA
## 8: 8 41 104 4 3 NA NA
## 9: 9 32 133 4 NA NA NA
## 10: 10 30 197 1 2 NA NA
This adds a secondary ID based on the first two columns and then spreads it to a wide format.

If you want to accomplish this using the tidyr package, I found it necessary to create an index variable:
df3 %>%
group_by(A) %>%
mutate(ind = row_number()) %>%
spread(ind, injury)

Related

How to combine/concatenate two dataframes one after the other but not merging common columns in R?

Suppose there are two dataframes as follows with same column names and I want to combine/concatenate one after the other without merging the common columns. There is a way of assigning it columnwise like df1[3]<-df2[1] but would like to know if there's some other way.
df1<-data.frame(A=c(1:10), B=c(2:5, rep(NA,6)))
df2<-data.frame(A=c(12:20), B=c(32:40))
Expected Output:
A B A.1 B.1
1 2 12 32
2 3 13 33
3 4 14 34
4 5 15 35
5 NA 16 36
6 NA 17 37
7 NA 18 38
8 NA 19 39
9 NA 20 40
10 NA NA NA
I tend to work with multiple frames like this as a list of frames. Try this:
LOF <- list(df1, df2)
maxrows <- max(sapply(LOF, nrow))
out <- do.call(cbind, lapply(LOF, function(z) z[seq_len(maxrows),]))
names(out) <- make.names(names(out), unique = TRUE)
out
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
One advantage of this is that it allows you to work with an arbitrary number of frames, not just two.
One base R way could be
setNames(Reduce(cbind.data.frame,
Map(`length<-`, c(df1, df2), max(nrow(df1), nrow(df2)))),
paste0(names(df1), rep(c('', '.1'), each=2)))
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
Another option is to use the merge function. The documentation can be a bit cryptic, so here is a short explanation of the arguments:
by -- "the name "row.names" or the number 0 specifies the row names"
all = TRUE -- keeps all original rows from both dataframes
suffixes -- specify how you want the duplicated colnames to be distinguished
sort -- keep original sorting
merge(df1, df2, by = 0, all = TRUE, suffixes = c('', '.1'), sort = FALSE)
One way would be
cbind(
df1,
rbind(
df2,
rep(NA, nrow(df1) - nrow(df2))
)
)
`````

Wide format: a function to calculate row means for specific batches of columns, then scale up for multiple batches

This is a followup question to a previous post of mine about building a function for calculating row means.
I want to use any function of the apply family to iterate over my dataset and each time compute the row mean (which is what the function does) for a group of columns I specify. Unfortunately, I miss something critical in the way I should tweak apply(), because I get an error that I can't troubleshoot.
Example Data
capital_cities_df <-
data.frame("europe_paris" = 1:10,
"europe_london" = 11:20,
"europe_rome" = 21:30,
"asia_bangkok" = 31:40,
"asia_tokyo" = 41:50,
"asia_kathmandu" = 51:60)
set.seed(123)
capital_cities_df <- as.data.frame(lapply(capital_cities_df,
function(cc) cc[ sample(c(TRUE, NA),
prob = c(0.70, 0.30),
size = length(cc),
replace = TRUE) ]))
> capital_cities_df
europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1 1 NA NA NA 41 NA
2 NA 12 22 NA 42 52
3 3 NA 23 33 43 NA
4 NA 14 NA NA NA NA
5 NA 15 25 35 45 NA
6 6 NA NA 36 NA 56
7 NA 17 NA NA NA 57
8 NA 18 NA 38 48 NA
9 NA 19 NA 39 49 NA
10 10 NA 30 40 NA 60
Custom Function
library(dplyr)
library(rlang)
continent_mean <- function(df, continent) {
df %>%
select(starts_with(continent)) %>%
dplyr::mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}
## works for a single case:
continent_mean(capital_cities_df, "europe")
europe_paris europe_london europe_rome europe
1 1 NA 21 11
2 2 12 22 12
3 3 NA 23 13
4 4 14 NA 9
5 NA 15 25 20
6 6 16 26 16
7 NA 17 NA 17
8 NA 18 NA 18
9 NA 19 NA 19
10 10 20 30 20
Trying to apply the function over the data, unsuccessfully
apply(
capital_cities_df,
MARGIN = 2,
FUN = continent_mean(capital_cities_df, continent = "europe")
)
Error in match.fun(FUN) :
'continent_mean(capital_cities_df, continent = "europe")' is not a function, character or symbol
Any other combination of the arguments in apply() didn't work either, nor did sapply. This unsuccessful attempt of using apply is only for one type of columns I wish to get the mean for ("europe"). However, my ultimate goal is to be able to pass c("europe", "asia", etc.) with apply, so I could get the custom function to create row means columns for all groups of columns I specify, in one hit.
What is wrong with my code?
Thanks!
EDIT 19-AUG-2019
I was trying the solution suggested by A. Suliman (see below). It did work for the example data I posted here, but not when trying to scale it up to my real dataset, where I need to subset additional columns (rather than the "continent" batch only). More specifically, in my real data I have an ID column which I want to get outputted along the other data, when I apply my custom-made function.
Example data including "ID" column
capital_cities_df <- data.frame(
"europe_paris" = 1:10,
"europe_london" = 11:20,
"europe_rome" = 21:30,
"asia_bangkok" = 31:40,
"asia_tokyo" = 41:50,
"asia_kathmandu" = 51:60)
set.seed(123)
capital_cities_df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA),
prob = c(0.70, 0.30),
size = length(cc),
replace = TRUE) ]))
id <- 1:10
capital_cities_df <- cbind(id, capital_cities_df)
> capital_cities_df
id europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1 1 1 NA NA NA 41 NA
2 2 NA 12 22 NA 42 52
3 3 3 NA 23 33 43 NA
4 4 NA 14 NA NA NA NA
5 5 NA 15 25 35 45 NA
6 6 6 NA NA 36 NA 56
7 7 NA 17 NA NA NA 57
8 8 NA 18 NA 38 48 NA
9 9 NA 19 NA 39 49 NA
10 10 10 NA 30 40 NA 60
My function (edited to select id as well)
continent_mean <- function(df, continent) {
df %>%
select(., id, starts_with(continent)) %>%
dplyr::mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}
> continent_mean(capital_cities_df, "europe") ## works in a single run
id europe_paris europe_london europe_rome europe
1 1 1 NA NA 1.000000
2 2 NA 12 22 12.000000
3 3 3 NA 23 9.666667
4 4 NA 14 NA 9.000000
5 5 NA 15 25 15.000000
6 6 6 NA NA 6.000000
7 7 NA 17 NA 12.000000
8 8 NA 18 NA 13.000000
9 9 NA 19 NA 14.000000
10 10 10 NA 30 16.666667
Trying to apply the function beyond the single use (based on A. Suliman's method) -- unsuccessfully
continents <- c("europe", "asia")
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, grep(x, names(capital_cities_df))], continent=x))
## or:
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, grep(.x, names(capital_cities_df))], continent=.x))
In either case I get a variety of error messages:
Error in inds_combine(.vars, ind_list) : Position must be between 0
and n
At other times:
Error: invalid column index : NA for variable: 'NA' = 'NA'
All I wanted was a simple function to let me calculate row means per specification of which columns to run over, but this gets nasty for some reason. Even though I'm eager to figure out what's wrong with my code, if anybody has a better overarching solution for the entire process I'd be thankful too.
Thanks!
Use lapply to loop through continents then use grep to select columns with the current continent
continents <- c("europe", "asia")
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, grep(x, names(capital_cities_df))], continent=x))
#To a dataframe not a list
do.call(cbind, lst)
Using map_dfc from purrr we can get the result in one step
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, grep(.x, names(capital_cities_df))], continent=.x))
Update:
#grep will return column positions when they match with "europe" or "asia", e.g
> grep("europe", names(capital_cities_df))
[1] 2 3 4
#If we need the column names then we add value=TRUE to grep
> grep("europe", names(capital_cities_df), value = TRUE)
[1] "europe_paris" "europe_london" "europe_rome"
So to add a new column we can just use the c() function and call the function as usual
#NOTE: Here I'm using the old function without select
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, c('id',grep(x, names(capital_cities_df), value = TRUE))], continent=x))
do.call(cbind, lst)
id europe_paris europe_london europe_rome europe id asia_bangkok asia_tokyo asia_kathmandu asia
1 1 1 NA NA 1.00000 1 NA 41 51 31.00000
2 2 NA 12 22 12.00000 2 NA 42 52 32.00000
3 3 3 13 23 10.50000 3 33 43 NA 26.33333
4 4 NA 14 NA 9.00000 4 NA 44 54 34.00000
5 5 NA 15 25 15.00000 5 35 45 55 35.00000
6 6 6 NA NA 6.00000 6 36 46 56 36.00000
7 7 7 17 27 14.50000 7 NA 47 57 37.00000
8 8 NA 18 28 18.00000 8 38 48 NA 31.33333
9 9 9 19 29 16.50000 9 39 49 NA 32.33333
10 10 10 NA 30 16.66667 10 40 NA 60 36.66667
#We have one problem, id column gets duplicated, map_dfc with select will solve this issue
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, c('id',grep(.x, names(capital_cities_df), value = TRUE))], continent=.x)) %>%
#Don't select any column name ends with id followed by one digit
select(-matches('id\\d'))
If you'd like to use the new function with select then just pass capital_cities_df without grep, e.g using map_dfc
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df, continent=.x)) %>%
select(-matches('id\\d'))
Correction: in continent_mean
continent_mean <- function(df, continent) {
df %>%
select(., id, starts_with(continent)) %>%
#Exclude id from the rowMeans calculation
dplyr::mutate(!!quo_name(continent) := rowMeans(.[grep(continent, names(.))], na.rm = TRUE))
}

common values of 2 columns in R

I have a file with 3 columns. the 1st column is ID, 2nd and 3rd are values for 2 conditions. in condition columns I have both - and + values. I would like to make 2 separate files. the 1st one would be for the negative values and the 2nd one would be for the positive values. do you know how to that in R?
Something like this ?
set.seed(1)
df1 <- data.frame(id=1:5,cond1 = sample(-100:100,5), cond2 = sample(-100:100,5))
df_neg <- df_pos <- df1
df_pos[,2:3][df1[,2:3]<0] <- NA # or 0, or NULL
df_neg[,2:3][df1[,2:3]>0] <- NA # or 0, or NULL
# > df1
# id cond1 cond2
# 1 1 -47 80
# 2 2 -26 88
# 3 3 13 31
# 4 4 79 24
# 5 5 -61 -88
# > df_pos
# id cond1 cond2
# 1 1 NA 80
# 2 2 NA 88
# 3 3 13 31
# 4 4 79 24
# 5 5 NA NA
# > df_neg
# id cond1 cond2
# 1 1 -47 NA
# 2 2 -26 NA
# 3 3 NA NA
# 4 4 NA NA
# 5 5 -61 -88

What is the difference between with and within in R?

I always use "with" instead of "within" within the context of my research, but I originally thought they were the same. Just now I mistype "with" for "within" and the results returned are quite different. I am wondering why?
I am using the baseball data in the plyr package, so I first load the library by
require(plyr)
Then, I want to select all rows with an id "ansonca01". At first, as I said, I used "within", and run the function as follows:
within(baseball, baseball[id=="ansonca01", ])
I got very strange results which basically includes everything:
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
44 forceda01 1871 1 WS3 32 162 45 45 9 4 0 29 8 0 4 0 NA NA NA NA NA
68 mathebo01 1871 1 FW1 19 89 15 24 3 1 0 10 2 1 2 0 NA NA NA NA NA
99 startjo01 1871 1 NY2 33 161 35 58 5 1 1 34 4 2 3 0 NA NA NA NA NA
102 suttoez01 1871 1 CL1 29 128 35 45 3 7 3 23 3 1 1 0 NA NA NA NA NA
106 whitede01 1871 1 CL1 29 146 40 47 6 5 1 21 2 2 4 1 NA NA NA NA NA
113 yorkto01 1871 1 TRO 29 145 36 37 5 7 2 23 2 2 9 1 NA NA NA NA NA
.........
Then I use "with" instead of "within",
with(baseball, baseball[id=="ansonca01",])
and got the results that I expected
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
121 ansonca01 1872 1 PH1 46 217 60 90 10 7 0 50 6 6 16 3 NA NA NA NA NA
276 ansonca01 1873 1 PH1 52 254 53 101 9 2 0 36 0 2 5 1 NA NA NA NA NA
398 ansonca01 1874 1 PH1 55 259 51 87 8 3 0 37 6 0 4 1 NA NA NA NA NA
525 ansonca01 1875 1 PH1 69 326 84 106 15 3 0 58 11 6 4 2 NA NA NA NA NA
I checked the documentation of with and within by typing help(with) in R environment, and got the following:
with is a generic function that evaluates expr in a local environment constructed from data. The environment has the caller's environment as its parent. This is useful for simplifying calls to modeling functions. (Note: if data is already an environment then this is used with its existing parent.)
Note that assignments within expr take place in the constructed environment and not in the user's workspace.
within is similar, except that it examines the environment after the evaluation of expr and makes the corresponding modifications to data (this may fail in the data frame case if objects are created which cannot be stored in a data frame), and returns it. within can be used as an alternative to transform.
From this explanation of the differences, I don't get why I obtained different results with such a simple operation. Anyone has ideas?
I find simple examples often work to highlight the difference. Something like:
df <- data.frame(a=1:5,b=2:6)
df
a b
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
with(df, {c <- a + b; df;} )
a b
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
within(df, {c <- a + b; df;} )
# equivalent to: within(df, c <- a + b)
# i've just made the return of df explicit
# for comparison's sake
a b c
1 1 2 3
2 2 3 5
3 3 4 7
4 4 5 9
5 5 6 11
The documentation is quite clear about the semantics and return values (and nicely matches the everyday meanings of the words “with” and “within”):
Value:
For ‘with’, the value of the evaluated ‘expr’. For ‘within’, the
modified object.
Since your code doesn’t modify anything inside baseball, the unmodified baseball is returned. with on the other hand doesn’t return the object, it returns expr.
Here’s an example where the expression modifies the object:
> head(within(cars, speed[dist < 20] <- 1))
speed dist
1 1 2
2 1 10
3 1 4
4 7 22
5 1 16
6 1 10
As above, with returns the value of the last evaluated expression. It is handy for one-liners such as:
with(cars, summary(lm (speed ~ dist)))
but is not suitable for sending multiple expressions.
I often find within useful for manipulating a data.frame or list (or data.table) as I find the syntax easy to read.
I feel that the documentation could be improved by adding examples of use in this regard, e.g.:
df1 <- data.frame(a=1:3,
b=4:6,
c=letters[1:3])
## library("data.table")
## df1 <- as.data.table(df1)
df1 <- within(df1, {
a <- 10:12
b[1:2] <- letters[25:26]
c <- a
})
df1
giving
a b c
1: 10 y 10
2: 11 z 11
3: 12 6 12
and
df1 <- as.list(df1)
df1 <- within(df1, {
a <- 20:23
b[1:2] <- letters[25:26]
c <- paste0(a, b)
})
df1
giving
$a
[1] 20 21 22 23
$b
[1] "y" "z" "6"
$c
[1] "20y" "21z" "226" "23y"
Note also that methods("within") gives only these object types, being:
within.data.frame
within.list
(and within.data.table if the package is loaded).
Other packages may define additional methods.
Perhaps unexpectedly for some, with and within are generally not appropriate choices when manipulating variables within defined environments...
To address the comment - there is no within.environment method. Using with requires you to have the function you're calling within the environment, which somewhat defeats the purpose for me e.g.
df1 <- as.environment(df1)
## with(df1, ls()) ## Error
assign("ls", ls, envir=df1)
with(df1, ls())

Split a dataframe dynamically

I would like to have a function to split data frames like this:
q1 q2 q3 q4
1 4 0 33
8 5 33 44
na na na na
na na na na
3 33 2 66
4 2 3 88
6 44 5 99
We will get 2 dataframes:
d1
q1 q2 q3 q4
1 4 0 33
8 5 33 44
and
d2
3 33 2 66
4 2 3 88
6 44 5 99
The obs in d1 and d2 are not fixed. This means that we do not know the obs in the dataframe and how many obs are NAs.
Suppose DF is the data frame. Since it wasn't specified precisely what the splitting criterion is lets assume that any row with all NAs is a dividing row. If its some other criterion change the first line appropriately:
isNA <- apply(is.na(DF), 1, all)
split(DF[ !isNA, ], cumsum( isNA )[ !isNA ])
First, read in your data so that "na" gets converted to actual NA values.
mydf <- read.table(
header = TRUE,
na.strings="na",
text = "q1 q2 q3 q4
1 4 0 33
8 5 33 44
na na na na
3 33 2 66
4 2 3 88
6 44 5 99")
Second, figure out where to split your data.frame:
# Find the rows where *all* the values are `NA`
RLE <- rle(rowSums(is.na(mydf)) == ncol(mydf))$lengths
# Use that to create "groups" of rows
RLE2 <- rep(seq_along(RLE), RLE)
# Replace even numbered rows with NA -- we don't want them
RLE2[RLE2 %% 2 == 0] <- NA
Third, split your data.frame
split(mydf, RLE2)
# $`1`
# q1 q2 q3 q4
# 1 1 4 0 33
# 2 8 5 33 44
#
# $`3`
# q1 q2 q3 q4
# 4 3 33 2 66
# 5 4 2 3 88
# 6 6 44 5 99
However, this is all somewhat guesswork, because your statement that "This means that we do not know the obs in the dataframe and how many obs are NAs" is not really clear. Here, I've made the assumption that you want to split the data whenever you encounter a full row of NA values.

Resources