Combining two data frames with new column(s) - r

I have two similar tables recording someones spending over 3 months.
From months 4-6 a new variable has been added.
df1 = data.frame(Month=c(1,2,3),Rent=c(132,123,234),Food=c(34,13,45))
df2 = data.frame(Month=c(4,5,6),Rent=c(111,212,231),Food=c(33,11,41),Fun=c(4,6,5))
> df1
Month Rent Food
1 1 132 34
2 2 123 13
3 3 234 45
> df2
Month Rent Food Fun
1 4 111 33 4
2 5 212 11 6
3 6 231 41 5
How can I combine/merge the two tables to look like this:
Month Rent Food Fun
1 1 132 34 NA
2 2 123 13 NA
3 3 234 45 NA
4 4 111 33 4
5 5 212 11 6
6 6 231 41 5

You can use join family functions for such tasks in the dplyr package as follows:
library(dplyr)
full_join(df1, df2)
Joining by: c("Month", "Rent", "Food")
Month Rent Food Fun
1 1 132 34 NA
2 2 123 13 NA
3 3 234 45 NA
4 4 111 33 4
5 5 212 11 6
6 6 231 41 5

Related

Reading in space seperated dataset in R for frequent items

I have a .txt file that consists of numbers separated by spaces. Each row has a different amount of numbers in it. I need to do market basket analysis on the data, however I can't seem to properly load the data (especially because there is a different number of items in each 'basket'). What is the best way to store the data so I can find the frequent items and then check for frequent items in each basket?
Example of data:
1 2 4 3 67 43 154
4 5 3 21 2
2 4 5 32 145
2 6 7 8 23 456 32 21 34 54
You should be able to input with readLines and then use lapply to separate into numerics. Assume that is in a file named txt.txt:
dat <- lapply( readLines("txt.txt"), function(Line) scan(text=Line) )
The reason I didn't suggest read.table with fill=TRUE (which would give yiu something similar to the otehr answer that has appeared is that the column stucture was not needed. unless there was information encoded in the position of those numbers. I'm wondering whether the might be additional information encoded in the individual lines such as regions or stores or some other entity as the source of particular numbered items. This would be the reason for keeping it in a list structure with an uneven count. You can get a global enumerations just with table:
table( unlist(dat) )
1 2 3 4 5 6 7 8 21 23 32 34 43 54 67 145 154 456
1 4 2 3 2 1 1 1 2 1 2 1 1 1 1 1 1 1
my_text = '1 2 4 3 67 43 154
4 5 3 21 2
2 4 5 32 145
2 6 7 8 23 456 32 21 34 54'
my_text2 <- strsplit(my_text, split = '\n')
my_text2 <- lapply(my_text2, trimws)
my_text2 %>%
do.call('rbind',.) %>%
t %>%
as.data.frame() %>%
separate(V1, sep = ' ',into = paste('col_', 1:10))
col_ 1 col_ 2 col_ 3 col_ 4 col_ 5 col_ 6 col_ 7 col_ 8 col_ 9 col_ 10
1 1 2 4 3 67 43 154 <NA> <NA> <NA>
2 4 5 3 21 2 <NA> <NA> <NA> <NA> <NA>
3 2 4 5 32 145 <NA> <NA> <NA> <NA> <NA>
4 2 6 7 8 23 456 32 21 34 54

How to extract all the columns from a data frame based on a column in another data frame?

I have two data frames. I want to extract all the columns from a data frame based on another data frame column.
df1:
sample
GY
AP
A9
MB
AU
df2:
num start end length GY A9 MB AP JK GH AU
2 23 24 567 5 6 7 8 9 0 1
2 3 44 57 8 6 7 3 4 0 9
2 234 54 67 5 6 7 8 9 0 1
result:
num start end length GY A9 MB AP AU
2 23 24 567 5 6 7 8 1
2 3 44 57 8 6 7 3 9
2 234 54 67 5 6 7 8 1
I tried in this way but it didn't work out:
u <- df1[df1$sample %in% colnames(df2),]
Can anyone tell me how to do this?
With:
df2[, c(1:4, which(colnames(df2) %in% df1$sample))]
you get:
num start end length GY A9 MB AP AU
1 2 23 24 567 5 6 7 8 1
2 2 3 44 57 8 6 7 3 9
3 2 234 54 67 5 6 7 8 1
And this also works:
df2[, c(rep(TRUE,4), tail(colnames(df2) %in% df1$sample, -4))]

Aggregation of all possible unique combinations with observations in the same column in R

I am trying to shorten a chunk of code to make it faster and easier to modify. This is a short example of my data.
order obs year var1 var2 var3
1 3 1 1 32 588 NA
2 4 1 2 33 689 2385
3 5 1 3 NA 678 2369
4 33 3 1 10 214 1274
5 34 3 2 10 237 1345
6 35 3 3 10 242 1393
7 78 6 1 5 62 NA
8 79 6 2 5 75 296
9 80 6 3 5 76 500
10 93 7 1 NA NA NA
11 94 7 2 4 86 247
12 95 7 3 3 54 207
Basically, what I want is R to find any possible and unique combination of two values (observations) in column "obs", within the same year, to create a new matrix or DF with observations being the aggregation of the originals. Order is not important, so 1+6 = 6+1. For instance, having 150 observations, I will expect 11,175 feasible combinations (each year).
I sort of got what I want with basic coding but, as you will see, is way too long (I have built this way 66 different new data sets so it does not really make a sense) and I am wondering how to shorten it. I did some trials (plyr,...) with no real success. Here what I did:
# For the 1st year, groups of 2 obs
newmatrix <- data.frame(t(combn(unique(data$obs[data$year==1]), 2)))
colnames(newmatrix) <- c("obs1", "obs2")
newmatrix$name <- do.call(paste, c(newmatrix[c("obs1", "obs2")], sep = "_"))
# and the aggregation of var. using indexes, which I will skip here to save your time :)
To ilustrate, here the result, considering above sample, of what I would get for the 1st year. NA is because I only computed those where the 2 values were valid. And only for variables 1 and 3. More, I did the sum but it could be any other possible Function:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 42 NA
2 2 1 6 1_6 37 NA
3 3 1 7 1_7 NA NA
4 4 3 6 3_6 15 NA
5 5 3 7 3_7 NA NA
6 6 6 7 6_7 NA NA
As for the 2 first lines in the 3rd year, same type of matrix:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 NA 3762
2 2 1 6 1_6 NA 2868
.......... etc ............
I hope I explained myself. Thank you in advance for your hints on how to do this more efficient.
I would use split-apply-combine to split by year, find all the combinations, and then combine back together:
do.call(rbind, lapply(split(data, data$year), function(x) {
p <- combn(nrow(x), 2)
data.frame(order=paste(x$order[p[1,]], x$order[p[2,]], sep="_"),
obs1=x$obs[p[1,]],
obs2=x$obs[p[2,]],
year=x$year[1],
var1=x$var1[p[1,]] + x$var1[p[2,]],
var2=x$var2[p[1,]] + x$var2[p[2,]],
var3=x$var3[p[1,]] + x$var3[p[2,]])
}))
# order obs1 obs2 year var1 var2 var3
# 1.1 3_33 1 3 1 42 802 NA
# 1.2 3_78 1 6 1 37 650 NA
# 1.3 3_93 1 7 1 NA NA NA
# 1.4 33_78 3 6 1 15 276 NA
# 1.5 33_93 3 7 1 NA NA NA
# 1.6 78_93 6 7 1 NA NA NA
# 2.1 4_34 1 3 2 43 926 3730
# 2.2 4_79 1 6 2 38 764 2681
# 2.3 4_94 1 7 2 37 775 2632
# 2.4 34_79 3 6 2 15 312 1641
# 2.5 34_94 3 7 2 14 323 1592
# 2.6 79_94 6 7 2 9 161 543
# 3.1 5_35 1 3 3 NA 920 3762
# 3.2 5_80 1 6 3 NA 754 2869
# 3.3 5_95 1 7 3 NA 732 2576
# 3.4 35_80 3 6 3 15 318 1893
# 3.5 35_95 3 7 3 13 296 1600
# 3.6 80_95 6 7 3 8 130 707
This enables you to be very flexible in how you combine data pairs of observations within a year --- x[p[1,],] represents the year-specific data for the first element in each pair and x[p[2,],] represents the year-specific data for the second element in each pair. You can return a year-specific data frame with any combination of data for the pairs, and the year-specific data frames are combined into a single final data frame with do.call and rbind.

Rank function to rank multiple variables in R

I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2

ddply type functionality on multiple datafrmaes

I have two dataframes that are structured as follows:
Dataframe A:
id sqft traf month
1 1030 16 35 1
1 1030 15 32 2
2 1027 1 31 1
2 1027 2 31 2
Dataframe B:
id price frequency month day
1 1030 8 196 1 1
2 1030 9 101 1 15
3 1030 10 156 1 30
4 1030 3 137 2 1
5 1030 7 190 2 15
6 1027 6 188 1 1
7 1027 1 198 1 15
8 1027 2 123 1 30
9 1027 4 185 2 1
10 1027 5 122 2 15
I want to output certain types of summary statistics (centered around each unique ID) from both these columns. This would be easy with ddply if say I wanted the mean price for each ID for each month (split by id and month) from Dataframe B or if I wanted the average ratio of sqft to traf for each id (split by id).
But what would be a potential solution if I wanted to make combined variables from both dataframes. For instance, how would I get the average price for each id/month (Dataframe B) divided by sqft for each id/month?
The varying frequencies at of the dataframes are measured makes combining them not easily doable. The only solution I've found so far is to ddply the first dataframe to extract average sqft/id/month and then pass that value into a second ddply call on the second dataframe.
Is there a more efficient/less convoluted way to do this? I would be splitting both dataframes on the same variables (id and month).
Thanks in advance for any suggestions!
In the case of the sample data, you could merge the two data sets like this (by specifying all.y = TRUE you can make sure that all rows of dfb are kept and, in this case, corresponding entries of dfa are repeated accordingly)
dfall <- merge(dfa, dfb, by = c("id", "month"), all.y=TRUE)
# id month sqft traf price frequency day
#1 1027 1 1 31 6 188 1
#2 1027 1 1 31 1 198 15
#3 1027 1 1 31 2 123 30
#4 1027 2 2 31 4 185 1
#5 1027 2 2 31 5 122 15
#6 1030 1 16 35 8 196 1
#7 1030 1 16 35 9 101 15
#8 1030 1 16 35 10 156 30
#9 1030 2 15 32 3 137 1
#10 1030 2 15 32 7 190 15
Then, you can use ddply as usual:
ddply(dfall, .(id, month), mutate, newcol = mean(price)/sqft)
# id month sqft traf price frequency day newcol
#1 1027 1 1 31 6 188 1 3.0000000
#2 1027 1 1 31 1 198 15 3.0000000
#3 1027 1 1 31 2 123 30 3.0000000
#4 1027 2 2 31 4 185 1 2.2500000
#5 1027 2 2 31 5 122 15 2.2500000
#6 1030 1 16 35 8 196 1 0.5625000
#7 1030 1 16 35 9 101 15 0.5625000
#8 1030 1 16 35 10 156 30 0.5625000
#9 1030 2 15 32 3 137 1 0.3333333
#10 1030 2 15 32 7 190 15 0.3333333
Edit: if you're looking for better performance, consider using dplyr instead of plyr. The equivalent dplyr code (including the merge) is:
library(dplyr)
dfall <- dfb %>%
left_join(., dfa, by = c("id", "month")) %>%
group_by(id, month) %>%
dplyr::mutate(newcol = mean(price)/sqft) # I added dplyr:: to avoid confusion with plyr::mutate
Of course, you could also check out data.table which is also very efficient.
AFAIK ddply is not designed to be used with different data frames at the same time.
dplyr does well here. This code merges the data frames, gets price and sqft means by unique id/month combination, then creates a new variable pricePerSqft.
require(dplyr)
dfa %>%
left_join(dfb, by = c("id", "month")) %>%
group_by(id, month) %>%
summarize(
avgPrice = mean(price),
avgSqft = mean(sqft)) %>%
mutate(pricePerSqft = round(avgPrice / avgSqft, 2))
Here's the result:
id month avgPrice avgSqft pricePerSqft
1 1027 1 3.0 1 3.00
2 1027 2 4.5 2 2.25
3 1030 1 9.0 16 0.56
4 1030 2 5.0 15 0.33

Resources