R Programming - Combine Data Columnwise - r

I have two data sets, and both have same dimensions, and want to combine them such that 1st column of second data set is stacked next to 1st column of first data set, and so on.
Consider below example, which is the expected output. Here, v1 is coming from data set 1, and v2 is coming from data set 2. I also want to keep the column header as it is.
| v1 | v2 |
|:------:|:------:|
| -0.71 | -0.71 |
| -0.71 | -0.71 |
| -0.71 | -0.71 |
| -0.71 | -0.71 |
| -0.71 | -0.71 |
| -0.71 | -0.71 |
I tried cbind() and data.frame(), but both led to second data being added after the full first data set, not column after column.
-> dim(firstDataSet)
100 200
-> dim(secondDataSet)
100 200
-> finalDataSet_cbind <- cbind(firstDataSet, secondDataSet)
-> dim(finalDataSet_cbind)
100 400
-> finalDataSet_dframe <- data.frame(firstDataSet, secondDataSet)
-> dim(finalDataSet_dframe)
100 400
Please suggest correct and better ways to achieve this, thanks.
UPDATE: Response to possible duplicate flag to this question:
That answer didn't work out for me. The data I get after following the solution, didn't result in what I want, and was similar to final output I get with cbind() approach explained above.
The first answer given, works out for me, but with a small issue of new column name assigned to each column, instead of keeping the original column headers.
Also, I don't have enough reputation to add comment to the accepted answer.

Probably not the most efficient solution with for loop, but works
data1 <- cbind(1:10,11:20, 21:30)
data2 <- cbind(1:10,11:20, 21:30)
combined <- NULL
for(i in 1:ncol(data1)){
combined <- cbind(combined, data1[,i], data2[,i])
}

To fix the column name requirement, you could do this. Basically, you first cbind, then you create an index in the right order. Using that index, you also create a vector of correct column names. You then index the order of the columns, and add the column names.
df1 <- df2 <- data.frame(v1=1:10,v2=11:20, v3=21:30)
final <- cbind(df1,df2)
indexed <- rep(1:ncol(df1), each = 2) + (0:1) * ncol(df1)
new_colnames <- colnames(final)[indexed]
final_ordered <- final[indexed]
colnames(final_ordered) <- new_colnames
v1 v1 v2 v2 v3 v3
1 1 1 11 11 21 21
2 2 2 12 12 22 22
3 3 3 13 13 23 23
4 4 4 14 14 24 24
5 5 5 15 15 25 25
6 6 6 16 16 26 26
7 7 7 17 17 27 27
8 8 8 18 18 28 28
9 9 9 19 19 29 29
10 10 10 20 20 30 30

Related

Generating random sample data in R with specified sample size and probability

I want to use R to write a model that will answer a general question about probability. The general question is below, followed by my specific questions about how to answer it using R code. If you know the answer to the general question (separate from the R code), and can explain the underlying statistical principles in plain English, I'm interested in that too!
Question: If I split a group of n objects, first through a 4-way splitter, then through a 7-way splitter (resulting in a total of 28 distinct groups), and each splitter results in a random distribution (i.e. the objects are split approximately equally), does the order of the splits impact the variance of the final 28 groups. If I split into 4 and then into 7, is that different than splitting into 7 and then into 4? Does the answer change if one splitter has greater variance than the other?
Specific R question: how can I write a model to answer this question? So far, I've tried using sample and rnorm to generate sample data. Simulating a 4-way splitter would look something like this:
sample(1:4, size=100000, replace=TRUE)
This is basically like rolling a 4-sided die 100,000 times and recording the number of instances of each number. I can use the table function to sum the instances, which gives me an output like this:
> table(sample(1:4, size=100000, replace=TRUE))
1 2 3 4
25222 24790 25047 24941
Now, I want to take each of those outputs and use them as the input for a 7-way split.
I tried saving the 4-way split as a variable and then plugging that vector in the the size = variable like this:
Split4way <- as.vector(table(sample(1:4, size=100000, replace=TRUE)))
as.vector(table(sample(1:7, size=Split4Way, replace=TRUE)))
But when I do that, instead of a matrix with 4 rows and 7 columns, I just get a vector with 1 row and 7 columns. It appears that "size" variable for the 7-way split only uses 1 of the 4 outputs from the 4-way split instead of using each of them.
> as.vector(table(sample(1:7, size = Split4up, replace=TRUE)))
[1] 3527 3570 3527 3511 3550 3480 3588
So, how can I generate a table or list that shows all the outputs of the 4-way split followed by the 7-way split, for a total of 28 splits?
AND
Is there a function that will allow me to customize the standard deviation of each splitting device? For example, can I dictate that the outputs of the 4-way splitter have a standard deviation of x%, and the outputs of the 7-way splitter have a standard deviation of x%?
We can illustrate your set-up by writing a function that will simulate n objects being passed into the splitters.
Imagine the object comes first to the 4-splitter. Let us randomly assign it a number from one to four to determine which way it is split. Next it comes to a seven splitter; we can also randomly assign it a number from one to seven to determine which final bin it will end up in.
The set up looks like this:
Final bins
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
\__|__|__|__|__|_/ \__|__|__|__|__|_/ \__|__|__|__|__|_/ \__|__|__|__|__|_/
| | | |
seven splitter seven splitter seven splitter seven splitter
| | | |
1 2 3 4
\___________________|____________________|___________________/
|
four splitter
|
input
We can see that any unique pair of numbers will cause the object to end up in a different bin.
For the second set-up, we reverse the order, so that the seven splitter comes first, but otherwise each object still gets a unique bin based on a unique pair of numbers:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
\__|__|__/ \__|__|__/ \__|__|__/ \__|__|__/ \__|__|__/ \__|__|__/ \__|__|__/
| | | | | | |
4 splitter 4 splitter 4 splitter 4 splitter 4 splitter 4 splitter 4 splitter
| | | | | | |
1 2 3 4 5 6 7
\__________|___________|___________|___________|___________|__________/
|
7 splitter
|
input
Note that we can either draw a random 1:4 then a random 1:7, or vice versa, but in either case the unique pair will determine a unique bin. The actual bin the object ends up in will change depending on the order in which the two numbers are applied, but this will not change the fact that each bin will get 1/28 of the objects passed in, and the variance will remain the same.
That means to simulate and compare the two set ups, we need only sample from 1:4 and 1:7 for each object passed in, then apply the two numbers in a different order to calculate the final bin:
simulate <- function(n) {
df <- data.frame(fours = sample(4, n, replace = TRUE),
sevens = sample(7, n, replace = TRUE))
df$four_then_seven <- 7 * (df$fours - 1) + df$sevens
df$seven_then_four <- 4 * (df$sevens - 1) + df$fours
return(df)
}
So let's examine how this would play out for 10 objects passed in:
set.seed(69) # Makes the example reproducible
simulate(10)
#> fours sevens four_then_seven seven_then_four
#> 1 4 6 27 24
#> 2 1 5 5 17
#> 3 3 7 21 27
#> 4 2 2 9 6
#> 5 4 2 23 8
#> 6 4 3 24 12
#> 7 1 4 4 13
#> 8 3 2 16 7
#> 9 3 7 21 27
#> 10 3 2 16 7
Now let's do a table of the quantities in each bin if we had 100,000 draws:
s <- simulate(100000)
seven_four <- table(s$seven_then_four)
seven_four
#>
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#> 3434 3607 3539 3447 3512 3628 3564 3522 3540 3539 3544 3524 3552 3644 3626 3578
#> 17 18 19 20 21 22 23 24 25 26 27 28
#> 3609 3616 3673 3617 3654 3637 3542 3624 3568 3651 3486 3523
four_seven <- table(s$four_then_seven)
four_seven
#>
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#> 3434 3512 3540 3552 3609 3654 3568 3607 3628 3539 3644 3616 3637 3651 3539 3564
#> 17 18 19 20 21 22 23 24 25 26 27 28
#> 3544 3626 3673 3542 3486 3447 3522 3524 3578 3617 3624 3523
If you sort these two tables from smallest number to largest number in each bin, you will see they are actually identical apart from the labels on their bins. The distribution of counts is completely unchanged. This means the variance / standard deviation is also the same in both cases:
var(four_seven)
#> [1] 3931.439
var(seven_four)
#> [1] 3931.439
The only way to change the variance / standard deviation is to "fix" the splitters so they do not have an equal probability.
I'm also struggling to interpret your use of variance and standard deviation. the best I can think of is doing this "splitting" non-uniformly
as an alternative to Allan's code, you could generate non-uniform samples by doing:
# how should the alternatives be weighted (normalised probability is also OK)
a <- c(1, 2, 3, 4) # i.e. last four times as much as first
b <- c(1, 1, 2, 2, 3, 3, 4)
x <- sample(28, 10000, prob=a %*% t(b), replace=TRUE)
note that prob is automatically normalised (i.e. by dividing by the sum) in sample. you can check that things are working with:
table((x-1) %% 4 + 1) should be close to a/sum(a) * 10000
table((x-1) %/% 4 + 1) should be close to b/sum(b) * 10000

Inserting missing records into one data frame from another data frame of a different size - vectorized solution?

I'll start by saying that filling in missing data in one data frame with info from another has one solution that may work for my problem. However, it solves it with a FOR loop, and I would prefer a vectorized solution.
I have 125 years of climate data with year, month, temperature, precipitation, and open pan evaporation. It is daily data summarized by month. Some years in the late 1800's have entire months missing, and I would like to substitute those missing months with its equivalent month from a 30-year average around that time.
I have pasted some of the code I've been playing with, below:
# For simplicity, let's pretend there are 5 months in the year, so year 3
# is the only year with a complete set of data, years 1 and 2 are missing some.
df1<-structure(
list(
Year=c(1,1,1,2,2,3,3,3,3,3),
Month=c(1,2,4,2,5,1,2,3,4,5),
Temp=c(-2,2,10,-4,12,2,4,8,14,16),
Precip=c(20,10,50,10,60,26,18,40,60,46),
Evap=c(2,6,30,4,48,4,10,32,70,40)
)
)
# This represents the 30-year average data:
df2<-structure(
list(
Month=c(1,2,3,4,5),
Temp=c(1,3,9,13,15),
Precip=c(11,13,21,43,35),
Evap=c(1,5,13,35,45)
)
)
# to match my actual setup
df1<-as_tibble(df1)
df2<-as_tibble(df2)
# I can get to the list of months missing from a given year
full_year <- df2[,1]
compare_year1 <- df1[df1$Year==1,2]
missing_months <- setdiff(full_year,compare_year1)
# Or I can get the full data from each year missing one or more months
year_full <- df2[,1]
years_compare <- split(df1[,c(2)], df1$Year)
years_missing_months <- names(years_compare[sapply(years_compare,nrow)<5])
complete_years_missing_months <- df1[df1$Year %in% years_missing_months,]
This is where I've gotten stumped.
I've looked at anti_join and merge, but it looks like they need data of the same length in each frame. I can get from lists grouped by year to identify the years that are missing months, but I'm not sure how to actually get the rows inserted from there. It seems like lapply could be useful, but the answer ain't comin'.
Thanks in advance.
Edit 7/19: As an illustration of what I need, just looking at year "1", the current data (df1) has the following:
Year | Mon | Temp | Precip | Evap
1 | 1 | -2 | 20 | 2
1 | 2 | 2 | 10 | 6
1 | 4 | 10 | 50 | 30
Months 3 and 5 are missing data, so I would like to insert the equivalent-month data from the 30-year average table (df2), so the final result for year "1" would look like:
Year | Mon | Temp | Precip | Evap
1 | 1 | -2 | 20 | 2
1 | 2 | 2 | 10 | 6
1 | 3 | 9 | 21 | 13
1 | 4 | 10 | 50 | 30
1 | 5 | 15 | 35 | 45
Then fill in every year missing months in like manner. Year "3" would have no change, because (in this 5-month example) there are no months missing data.
First just add rows to hold the imputed values, since you know that there are missing rows with known dates:
df1$date <- as.Date(paste0("200",df1$Year,"/",df1$Month,"/01"))
pretend_12months <- seq(min(df1$date),max(df1$date),by = "1 month")
pretend_5months <- pretend_12months[lubridate::month(pretend_12months) < 6]
pretend_5months <- data.frame(date=pretend_5months)
new <- merge(df1,
pretend_5months,
by="date",
all=TRUE)
new$Year <- ifelse(is.na(new$Year),
substr(lubridate::year(new$date),4,4),
new$Year)
new$Month <- ifelse(is.na(new$Month),
lubridate::month(new$date),
new$Month)
Impute the NA values using a left join:
# key part: left join using any library or builtin method (left_join,merge, etc)
fillin <- sqldf::sqldf("select a.date,a.Year,a.Month, b.Temp, b.Precip, b.Evap from new a left join df2 b on a.Month = b.Month")
# apply data set from join to the NA data
new$Temp[is.na(new$Temp)] <- fillin$Temp[is.na(new$Temp)]
new$Precip[is.na(new$Precip)] <- fillin$Precip[is.na(new$Precip)]
new$Evap[is.na(new$Evap)] <- fillin$Evap[is.na(new$Evap)]
date Year Month Temp Precip Evap
1 2001-01-01 1 1 -2 20 2
2 2001-02-01 1 2 2 10 6
3 2001-03-01 1 3 9 21 9
4 2001-04-01 1 4 10 50 30
5 2001-05-01 1 5 15 35 15
6 2002-01-01 2 1 1 11 1
7 2002-02-01 2 2 -4 10 4
8 2002-03-01 2 3 9 21 9
9 2002-04-01 2 4 13 43 13
10 2002-05-01 2 5 12 60 48
11 2003-01-01 3 1 2 26 4
12 2003-02-01 3 2 4 18 10
13 2003-03-01 3 3 8 40 32
14 2003-04-01 3 4 14 60 70
15 2003-05-01 3 5 16 46 40

Searching a vector/data table backwards in R

Basically, I have a very large data frame/data table and I would like to search a column for the first, and closest, NA value which is less than my current index position.
For example, let's say I have a data frame DF as follows:
INDEX | KEY | ITEM
----------------------
1 | 10 | AAA
2 | 12 | AAA
3 | NA | AAA
4 | 18 | AAA
5 | NA | AAA
6 | 24 | AAA
7 | 29 | AAA
8 | 31 | AAA
9 | 34 | AAA
From this data frame we have an NA value at index 3 and at index 5. Now, let's say we start at index 8 (which has KEY of 31). I would like to search the column KEY backwards such that the moment it finds the first instance of NA the search stops, and the index of the NA value is returned.
I know there are ways to find all NA values in a vector/column (for example, I can use which(is.na(x)) to return the index values which have NA) but due to the sheer size of the data frame I am working and due to the large number of iterations that need to be performed this is a very inefficient way of doing it. One method I thought of doing is creating a kind of "do while" loop and it does seem to work, but this again seems quite inefficient since it needs to perform calculations each time (and given that I need to do over 100,000 iterations this does not look like a good idea).
Is there a fast way of searching a column backwards from a particular index such that I can find the index of the closest NA value?
Why not do a forward-fill of the NA indexes once, so that you can then look up the most recent NA for any row in future:
library(dplyr)
library(tidyr)
df = df %>%
mutate(last_missing = if_else(is.na(KEY), INDEX, as.integer(NA))) %>%
fill(last_missing)
Output:
> df
INDEX KEY ITEM last_missing
1 1 10 AAA NA
2 2 12 AAA NA
3 3 NA AAA 3
4 4 18 AAA 3
5 5 NA AAA 5
6 6 24 AAA 5
7 7 29 AAA 5
8 8 31 AAA 5
9 9 34 AAA 5
Now there's no need to recalculate every time you need the answer for a given row. There may be more efficient ways to do the forward fill, but I think exploring those is easier than figuring out how to optimise the backward search.

R creating a cluster based on overlapping /intersecting rows

I have the following data frame in R that has overlapping data in the two columns a_sno and b_sno
a_sno<- c(4,5,5,6,6,7,9,9,10,10,10,11,13,13,13,14,14,15,21,21,21,22,23,23,24,25,183,184,185,185,200)
b_sno<-c(5,4,6,5,7,6,10,13,9,13,14,15,9,10,14,10,13,11,22,23,24,21,21,25,21,23,185,185,183,184,200)
df = data.frame(a_sno, b_sno)
If you take a close look at the data you can see that the 4,5,6&7 intersect/ overlap and I need to put them into a group called 1.
Like wise 9,10,13,14 into group 2, 11 and 15 into group 3 etc.... and 200 is not intersecting with any other row but still need to be assigned its own group.
The resulting output should look like this:
---------
group|sno
---------
1 | 4
1 | 5
1 | 6
1 | 7
2 | 9
2 | 10
2 | 13
2 | 14
3 | 11
3 | 15
4 | 21
4 | 22
4 | 23
4 | 24
4 | 25
5 | 183
5 | 184
5 | 185
6 | 200
Any help to get this done is much appreciated. Thanks
Probably not the most efficient solution but you could use graphs to do this:
#sort the data by row and remove duplicates
df = unique(t(apply(df,1,sort)))
#load the library
library(igraph)
#make a graph with your data
graph <-graph.data.frame(df)
#decompose it into components
components <- decompose.graph(graph)
#get the vertices of the subgraphs
result<-lapply(seq_along(components),function(i){
vertex<-as.numeric(V(components[[i]])$name)
cbind(rep(i,length(vertex)),vertex)
})
#make the final dataframe
output<-as.data.frame(do.call(rbind,result))
colnames(output)<-c("group","sno")
output

Subtract pairs of columns based on matching column

I'll apologise in advance - I know this has likely been answered elsewhere, but I don't seem to be able to find the answer I need, and can't manage to adapt other code I have found to my needs.
I have a data frame:
FILE | TECHNIQUE | COUNT
------------------------
A | ONE | 10
A | TWO | 25
B | ONE | 5
B | TWO | 30
C | ONE | 30
C | TWO | 50
I would like to produce a data frame of the difference of the COUNT values between ONE and TWO, with a row for each FILE, i.e.
FILE | DIFFERENCE
-----------------
A | 15
B | 25
C | 20
I'm convinced I should be able to do this fairly easily with base R or Plyr, but am a bit stuck. Could anyone suggest a good way to do this, and perhaps good tutorials on Plyr that might help me with similar problems in the future?
Thanks
Using aggregate in base:
> aggregate(.~FILE, data= DF[, -2], FUN=diff)
FILE COUNT
1 A 15
2 B 25
3 C 20
Using ddply in plyr
> ddply(DF[,-2], .(FILE), summarize, DIFFERENCE=diff(COUNT))
FILE DIFFERENCE
1 A 15
2 B 25
3 C 20
with data.table
> # library(data.table)
> DT <- data.table(DF)
> DT[, diff(COUNT), by=FILE]
FILE V1
1: A 15
2: B 25
3: C 20
with by
> with(DF, by(COUNT, FILE, diff))
FILE: A
[1] 15
-----------------------------------------------------------------------------
FILE: B
[1] 25
-----------------------------------------------------------------------------
FILE: C
[1] 20
with tapply
> tapply(DF$COUNT, DF$FILE, diff)
A B C
15 25 20
with summaryBy from doBy package
> # library(doBy)
> summaryBy(COUNT~FILE, FUN=diff, data=DF)
FILE COUNT.diff
1 A 15
2 B 25
3 C 20
Update
As percentage:
> aggregate(.~FILE, data= DF[, -2], function(x) (x[1]/x[2])*100)
FILE COUNT
1 A 40.00000
2 B 16.66667
3 C 60.00000

Resources