I've had this problem before, but I didn't write down the solution, so now I'm in trouble again!
I have a dataframe like the following:
Date Product Qty Income
201001 0001 1000 2000
201002 0001 1500 3000
201003 0001 1200 2400
.
.
201001 0002 3500 2000
201002 0002 3200 1900
201003 0002 3100 1850
In words, I have one line for each combination of Date/Product, and the information of Quantity and Income for each combination.
I want to rearrange this dataframe so it looks like the following:
Date Qty.0001 Income.0001 Qty.0002 Income.0002
201001 1000 2000 3500 2000
201002 1500 3000 3200 1900
201003 1200 2400 3100 1850
In words, I want to have one line for each date, and one column for each combination of Product/Information(Qty, Income).
How can I achieve this? Thanks in advance!
Use reshape:
reshape(x,idvar="Date",timevar="Product",direction="wide")
Date Qty.0001 Income.0001 Qty.0002 Income.0002
1 201001 1000 2000 3500 2000
2 201002 1500 3000 3200 1900
3 201003 1200 2400 3100 1850
Related
I have the following dataframe.
SEC VORDEN_PREVENT1 VORDEN_PREVENT2 VORDEN_PREVENT3 VORDEN_PREVENT4 VORDEN_PREVENT5
2484628 1500 1328 2761 3003 2803
2491884 1500 1500 1169 2813 1328
2521158 1500 2813 1328 2761 3003
2548370 1500 1257 2595 1187 1837
2580994 1500 5057 2624 2940 2731
2670164 1500 1874 1218 2791 2892
In this dataframe I have as VORDEN_PREVENT* the number of cars sold every day, for example VORDEN_PREVENT1 means that I sold this day 1500 cars, what I want is to return the columns from the rows that produces a purchase of for example 3000 cars.
For that example, should be 1500 from VORDEN_PREVENT1, 1328 from VORDEN_PREVENT2 and 172 from VORDEN_PREVENT3, which is the difference from 2761 and the sum from VORDEN_PREVENT1 and VORDEN_PREVENT2.
I don't know how to obtain this row and column data and to get the difference properly, to obtain my data correctly.
If I understand correctly, the VORDEN_PREVENT* columns denote sales on subsequent days. The OP asks on which day the cumulative sum of sales exceeds a given threshold. In addition the OP wants to see the sales figures which sum up to threshold.
I suggest to solve this type of questions in long format where columns can be treated as data.
1. melt() / dcast()
library(data.table)
threshold <- 3000L
long <- melt(setDT(DT), id.var = "SEC")
long[, value := c(value[1L], diff(pmin(cumsum(value), threshold))), by = SEC]
dcast(long[value > 0], SEC ~ variable)
SEC VORDEN_PREVENT1 VORDEN_PREVENT2 VORDEN_PREVENT3
1: 2484628 1500 1328 172
2: 2491884 1500 1500 NA
3: 2521158 1500 1500 NA
4: 2548370 1500 1257 243
5: 2580994 1500 1500 NA
6: 2670164 1500 1500 NA
2. gather() / spread()
library(tidyr)
library(dplyr)
threshold <- 3000L
DT %>%
gather(, , -SEC) %>%
group_by(SEC) %>%
mutate(value = c(value[1L], diff(pmin(cumsum(value), threshold)))) %>%
filter(value >0) %>%
spread(key, value)
# A tibble: 6 x 4
# Groups: SEC [6]
SEC VORDEN_PREVENT1 VORDEN_PREVENT2 VORDEN_PREVENT3
<int> <int> <int> <int>
1 2484628 1500 1328 172
2 2491884 1500 1500 NA
3 2521158 1500 1500 NA
4 2548370 1500 1257 243
5 2580994 1500 1500 NA
6 2670164 1500 1500 NA
3. apply()
With base R:
DT[, -1] <- t(apply(DT[, -1], 1, function(x) c(x[1L], diff(pmin(cumsum(x), threshold)))))
DT
SEC VORDEN_PREVENT1 VORDEN_PREVENT2 VORDEN_PREVENT3 VORDEN_PREVENT4 VORDEN_PREVENT5
1 2484628 1500 1328 172 0 0
2 2491884 1500 1500 0 0 0
3 2521158 1500 1500 0 0 0
4 2548370 1500 1257 243 0 0
5 2580994 1500 1500 0 0 0
6 2670164 1500 1500 0 0 0
Data
library(data.table)
DT <- fread("
SEC VORDEN_PREVENT1 VORDEN_PREVENT2 VORDEN_PREVENT3 VORDEN_PREVENT4 VORDEN_PREVENT5
2484628 1500 1328 2761 3003 2803
2491884 1500 1500 1169 2813 1328
2521158 1500 2813 1328 2761 3003
2548370 1500 1257 2595 1187 1837
2580994 1500 5057 2624 2940 2731
2670164 1500 1874 1218 2791 2892",
data.table = FALSE)
Your question is not very clear to me so I reduce it to what I understand (you want to create a column, then filter on rows). Using dplyr this can be done quite easily but we first recreate some data.
# recreate some data
df <- data.frame(time=1:3,
sales1=c(1234, 1567, 2045),
sales2=c(865, 756, 890))
# first create a diff column
df <- df %>% mutate(sales_diff=sales1-sales2)
df
time sales1 sales2 sales_diff
1 1234 865 369
2 1567 756 811
3 2045 890 1155
# then you can access the rows you're interested in by filtering them
df %>% filter(sales1==1567)
time sales1 sales2 sales_diff
2 1567 756 811
You can just replace the object/column names with your own data.
Is that what you were looking for?
I have a list time_info_summary
> time_info_summary
$mon_from_time
0700 0800
14 388
$mon_to_time
1800 1830 2000 2100 2200 2300
1 1 60 121 214 5
$tue_from_time
0700 0800
14 388
$tue_to_time
1800 1830 2000 2100 2200 2300
1 1 60 121 214 5
It is a list of tables.
> typeof(time_info_summary)
[1] "list"
It contains the time (like for "mon_from_time" it is 0700 for which count is 14 and for 0800 count is 388). I want to calculate the weighted average of time for each of the item of list. i.e., what is the weighted average for "mon_from_time" ((0700*14 +0800*388)/402 = 0796) and so on...how can I do this
This should work even though it's not very elegant.
> tabs<- apply(time_info_summary, 2, table)
> lapply(tabs, function(x) weighted.mean(as.numeric(names(x)), x)/100)
$mon_from_time
[1] 7.965174
$mon_to_time
[1] 10.85348
$tue_from_time
[1] 7.965174
$tue_to_time
[1] 10.85348
I'm having a beginner's issue aggregating the data for a category of data, creating a new column with the sum of each category's data for each observance.
I'd like the following data:
PIN Balance
221 5000
221 2000
221 1000
554 4000
554 4500
643 6000
643 4000
To look like:
PIN Balance Total
221 5000 8000
221 2000 8000
221 1000 8000
554 4000 8500
554 4500 8500
643 6000 10000
643 4000 10000
I've tried using aggregate: output <- aggregate(df$Balance ~ df$PIN, data = df, sum) but haven't been able to get the data back into my original dataset as the number of obsverations were off.
You can use dplyr to do what you want. We first group_by PIN and then create a new column Total using mutate that is the sum of the grouped Balance:
library(dplyr)
res <- df %>% group_by(PIN) %>% mutate(Total=sum(Balance))
Using your data as a data frame df:
df <- structure(list(PIN = c(221L, 221L, 221L, 554L, 554L, 643L, 643L
), Balance = c(5000L, 2000L, 1000L, 4000L, 4500L, 6000L, 4000L
)), .Names = c("PIN", "Balance"), class = "data.frame", row.names = c(NA,
-7L))
## PIN Balance
##1 221 5000
##2 221 2000
##3 221 1000
##4 554 4000
##5 554 4500
##6 643 6000
##7 643 4000
We get the expected result:
print(res)
##Source: local data frame [7 x 3]
##Groups: PIN [3]
##
## PIN Balance Total
## <int> <int> <int>
##1 221 5000 8000
##2 221 2000 8000
##3 221 1000 8000
##4 554 4000 8500
##5 554 4500 8500
##6 643 6000 10000
##7 643 4000 10000
Or we can use data.table:
library(data.table)
setDT(df)[,Table:=sum(Balance),by=PIN][]
## PIN Balance Total
##1: 221 5000 8000
##2: 221 2000 8000
##3: 221 1000 8000
##4: 554 4000 8500
##5: 554 4500 8500
##6: 643 6000 10000
##7: 643 4000 10000
Consider a base R solution with a sapply() conditional sum approach:
df <- read.table(text="PIN Balance
221 5000
221 2000
221 1000
554 4000
554 4500
643 6000
643 4000", header=TRUE)
df$Total <- sapply(seq(nrow(df)), function(i){
sum(df[df$PIN == df$PIN[i], c("Balance")])
})
# PIN Balance Total
# 1 221 5000 8000
# 2 221 2000 8000
# 3 221 1000 8000
# 4 554 4000 8500
# 5 554 4500 8500
# 6 643 6000 10000
# 7 643 4000 10000
Records:-
UniqueID Country Price
AAPL USA 107
AAPL USA 105
GOOG USA 555
GOOG USA 555
VW DEU 320
Mapping:-
UniqueID Country Price
AAPL USA 120
GOOG USA 550
VW DEU 300
I want to add a column Final and map the values from the mapping table to the records tables . For e.g. all the AAPL entries in the records table should have a final value of 120.
Output:-
Records:-
UniqueID Country Price Final
AAPL USA 107 120
AAPL USA 105 120
GOOG USA 555 550
GOOG USA 555 550
VW DEU 320 300
I used the following line of code:-
Records$Final <- Mapping[which(Records$UniqueID==Mapping$UniqueID),"Price"]
It throws me an error saying the replacement and data length are different. Also using merge duplicates the columns, which I don't want to.
We can use inner_join,
library(dplyr)
inner_join(records, Mapping, by = c('UniqueID', 'Country'))
# UniqueID Country Price.x Price.y
#1 AAPL USA 107 120
#2 AAPL USA 105 120
#3 GOOG USA 555 550
#4 GOOG USA 555 550
#5 VW DEU 320 300
To follow your method then,
Records$Final <- Mapping$Price[match(Records$UniqueID, Mapping$UniqueID)]
Records
# UniqueID Country Price Final
#1 AAPL USA 107 120
#2 AAPL USA 105 120
#3 GOOG USA 555 550
#4 GOOG USA 555 550
#5 VW DEU 320 300
First, in the Mapping table rename the column Price to Final
colnames(Mapping)[colnames(Mapping) == "Price"] <- "Final"
Then, use merge(). You should be getting what you wanted
Records=data.frame(UniqueID=c("AAPL","AAPL","GOOG","GOOG","VW"),country=c("USA","USA","USA","USA","DEU"),Price=c(107,105,555,555,320))
Mapping=data.frame(UniqueID=c("AAPL","GOOG","VW"),country=c("USA","USA","DEU"),Price=c(120,550,300))
names(Mapping)[3] <- "Final"
Output <- merge(x=Records,y=Mapping[,c(1,3)],by="UniqueID",all.x=TRUE)
I am new to R. I have two data frames as
PriceData
Date AAPL MSFT GOOG
12/3/2014 100 45 522
12/2/2014 99 45 517
12/1/2014 97 45 511
11/28/2014 97 44 508
QuantityData
Symbol Position
MSFT 1000
AAPL 1200
GOOG 1300
Now I want to calculate market value. So output should be like this
Date AAPL MSFT GOOG
12/3/2014 120000 45000 678600
12/2/2014 118800 45000 672100
12/1/2014 116400 45000 664300
11/28/2014 116400 44000 660400
You can try
indx <- match(colnames(PriceData)[-1], QuantityData$Symbol)
PriceData[,-1][,indx] <- PriceData[,-1][,indx]*
QuantityData[,2][col(PriceData[,-1])]
PriceData
# Date AAPL MSFT GOOG
#1 12/3/2014 120000 45000 678600
#2 12/2/2014 118800 45000 672100
#3 12/1/2014 116400 45000 664300
#4 11/28/2014 116400 44000 660400
Or
PriceData[,-1][,indx] <- t(t(PriceData[,-1][,indx])*QuantityData[,2])