I am thinking how to lookup time data by University name (first row: A,...,F), Field name (first column: Acute,...,En) and/or graduation time (time) in the following file DS.csv.
I am thinking dplyr approach but could not expand numerical ID lookup (thread answer How to overload function parameters in R?) to the lookup by three variables.
Challenges
How to lookup by the first row? Maybe, something similar to $1 == "A".
How to Expand university lookup to two columns? Pseudocode $1 == "A" is about the second and third column, ..., $1 == "F" about two last columns.
Do lookup by 3 lookup criterias: first row (no header), first column with header Field and for the header time. Pseudocode
times <- getTimes($1 == "A", Field == "Ane", by = "desc(time)")
Data DS.csv has the data. The first column denotes experiment. The data below is in crosstab format such that
,A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3
and in the straight table format such that
Field,time,T,Experiment
Acut,0,0,A
An,9,120,A
En,15.6,2,A
Fo,9.2,2,A
Acute,8.3,1,B
An,7.7,26,B
En,12.9,1,B
Fo,0,0,B
Acute,7.5,1,C
An,7.9,43,C
En,0,0,C
Fo,5.4,1,C
Acute,8.6,2,D
An,7.8,77,D
En,0,0,D
Fo,0,0,D
Acute,0,0,E
An,7.9,60,E
En,14.3,1,E
Fo,0,0,E
Acute,8.3,4,F
An,8.2,326,F
En,14.6,4,F
Fo,7.9,3,F
Pseudocode
library('dplyr')
ow <- options("warn")
DF <- read.csv("/home/masi/CSV/DS.csv", header = T)
# Lookup by first row, Lookup by Field, lookup by Field's first column?
times <- getTimes($1 == "A", Field == "Ane", by = "desc(time)")
Expected output: 9
Expected output generalised: a, b, c, ...
## Data where values marked by small letters a,b,c, ... are wanted
# uni1 uni2 ...
# time T time T ...
#Field1 a c
#Field2 b ...
#... ...
R: 3.3.3 (2017-03-06)
OS: Debian 8.7
Hardware: Asus Zenbook UX303UA
Taking your initial raw data as starting point:
# read the data & skip 1st & 2nd line which contain only header information
DF <- read.csv(text=",A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3", header=FALSE, stringsAsFactors=FALSE, skip=2)
# read the first two lines which contain the header information
headers <- read.csv(text=",A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3", header=FALSE, stringsAsFactors=FALSE, nrow=2)
# extract the university names for the 'headers' data.frame
universities <- unlist(headers[1,])
universities <- universities[universities != '']
# create column names from the 'headers' data.frame
vec <- headers[2,][headers[2,] == 'T']
headers[2,][headers[2,] == 'T'] <- paste0(vec, seq_along(vec))
names(DF) <- paste0(headers[2,],headers[1,])
You dataframe now looks as follows:
> DF
Field timeA T1 timeB T2 timeC T3 timeD T4 timeE T5 timeF T6
1: Acute 0.0 0 8.3 1 7.5 1 8.6 2 0.0 0 8.3 4
2: Ane 9.0 120 7.7 26 7.9 43 7.8 77 7.9 60 8.2 326
3: En 15.6 2 12.9 1 0.0 0 0.0 0 14.3 1 14.6 4
4: Fo 9.2 2 0.0 0 5.4 1 0.0 0 0.0 0 7.9 3
As it is better to transform you data into long format:
library(data.table)
DT <- melt(setDT(DF), id = 1,
measure.vars = patterns('^time','^T'),
variable.name = 'university',
value.name = c('time','t')
)[, university := universities[university]][]
Now your data looks like:
> DT
Field university time t
1: Acute A 0.0 0
2: Ane A 9.0 120
3: En A 15.6 2
4: Fo A 9.2 2
5: Acute B 8.3 1
6: Ane B 7.7 26
7: En B 12.9 1
8: Fo B 0.0 0
9: Acute C 7.5 1
10: Ane C 7.9 43
11: En C 0.0 0
12: Fo C 5.4 1
13: Acute D 8.6 2
14: Ane D 7.8 77
15: En D 0.0 0
16: Fo D 0.0 0
17: Acute E 0.0 0
18: Ane E 7.9 60
19: En E 14.3 1
20: Fo E 0.0 0
21: Acute F 8.3 4
22: Ane F 8.2 326
23: En F 14.6 4
24: Fo F 7.9 3
Now you can select the required info:
DT[university == 'A' & Field == 'Ane']
which gives:
Field university time t
1: Ane A 9 120
Several dplyr examples to filter the data:
library(dplyr)
DT %>%
filter(Field=="En" & t > 1)
gives:
Field university time t
1 En A 15.6 2
2 En F 14.6 4
Or:
DT %>%
arrange(desc(time)) %>%
filter(time < 14 & t > 3)
gives:
Field university time t
1 Ane A 9.0 120
2 Acute F 8.3 4
3 Ane F 8.2 326
4 Ane C 7.9 43
5 Ane E 7.9 60
6 Ane D 7.8 77
7 Ane B 7.7 26
Change your crosstab
,A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3
into a straight data format
Field,time,T,Experiment
Acut,0,0,A
An,9,120,A
En,15.6,2,A
Fo,9.2,2,A
Acute,8.3,1,B
An,7.7,26,B
En,12.9,1,B
Fo,0,0,B
Acute,7.5,1,C
An,7.9,43,C
En,0,0,C
Fo,5.4,1,C
Acute,8.6,2,D
An,7.8,77,D
En,0,0,D
Fo,0,0,D
Acute,0,0,E
An,7.9,60,E
En,14.3,1,E
Fo,0,0,E
Acute,8.3,4,F
An,8.2,326,F
En,14.6,4,F
Fo,7.9,3,F
where I used Vim.csv plugin and visual-block mode.
Multiple ways to do the selection
This is very easy to do in multiple ways after tidying the data into easy-to-format straight table (not crosstab), I would prefer SQL. I demonstare SQLDDF-package below that is very inefficient with large data but this is small so it will work.
Also instead of the very inefficient builtin functions, such as read.csv, I would refer the very efficient fread in data.table package for reading files.
SQLDF
> library(data.table);
> a<-fread("~/DS_straight_table.csv");
> sqldf("select time from a where Experiment='A' and Field='An'")
time
1 9
Other without sqldf
> library(data.table);
> a<-fread("~/DS_straight_table.csv");
> a[Experiment=='A' & Field=='An']
Field time T Experiment
1: An 9 120 A
Using the "Tall" (straight table) format and library dplyr. Your data only has one value per Field, Experiment.
library(dplyr)
## this is the more general result
df %>%
group_by(Field, Experiment) %>%
top_n(1, wt = -time)
## example function
getTimes<- function(data, field, experiment) {
data %>%
filter(Field == field, Experiment == experiment) %>%
top_n(1, wt = -time)
}
getTimes(df, 'An', 'A')
# Field time T Experiment
# 1 An 9 120 A
Related
I have a dataset df with multiple variables and unique IDs
ID A B C D
1 20 5 5.4 120.5
1 30 10 6.8 110.6
2 50 40 7.5 117.8
3 10 50 3.4 119
3 80 30 2.8 117.5
2 5 20 9.5 325.4
I can subset them by below code
new.df <- df[df$A < 56 & is.na(df$A) == FALSE,]
and I want the conditional column and subset the data frame by IDs
I Want the data frame with conditional column such as
ID =1 A=20 B=10 C=5.4 D=110.6
ID =2 A=5 B=40 C=9.5 D=325.4
ID =3 A=10 B=30 C=3.4 D=119
and output data frame should be
ID A B C D
1 20 10 5.4 110.6
2 5 40 9.5 325.4
3 10 30 3.4 119
can you guys help me out how it can be done
this will give the output as a dataframe ,
df %>% group_by(ID) %>% summarise(minA=min(A), maxB = max(B), minC= min(C), minD = min(D))
I have a large data set of samples that belong to different groups and differ in the area covered. The structure of the data set is simplified below. I now would like to create pooled samples (Subgroups) for each Group where the area covered by each Subgroup equates to a specified area (e.g. 20). Samples should be allocated randomly and without replacement to each Subgroup and the number of the Subgroup should be listed in a new column at the end of the data frame.
SampleID Group Area Subgroup
1 A 1.5 1
2 A 3.8 2
3 A 6 4
4 A 1.9 1
5 A 1.5 3
6 A 4.1 1
7 A 3.7 1
8 A 4.5 3
...
300 B 1.2 1
301 B 3.8 1
302 B 4.1 4
303 B 2.6 3
304 B 3.1 5
305 B 3.5 3
306 B 2.1 2
...
2000 S 2.7 5
...
I am currently using the ‘cumsum’ command to create the Subgroups, using the code below.
dat <- read.table("Pooling_Test.txt", header = TRUE, sep = "\t")
dat$CumArea <- cumsum(dat$Area)
dat$Diff_CumArea <- c(0, head(cumsum(dat$Area), -1))
dat$Sample_Int_1 <- "0"
dat$Sample_End <- "0"
current.sum <- 0
for (c in 1:nrow(dat)) {
current.sum <- current.sum + dat[c, "Area"]
dat[c, "Diff_CumArea"] <- current.sum
if (current.sum >= 20) {
dat[c, "Sample_Int_1"] <- "1"
dat[c, "Sample_End"] <- "End"
current.sum <- 0
dat$Sample_Int_2 <- cumsum(dat$Sample_Int_1)+1
dat$Sample_Final <- dat$Sample_Int_2
for (d in 1:nrow(dat)) {
if (dat$Sample_End[d] == 'End')
dat$Subgroup[d] <- dat$Sample_Int_2[d]-1
else 0 }
}}
write.csv(dat, file = 'Pooling_Test_Output.csv', row.names = FALSE)
The resultant data frame shows what I want (see below). However, there are a couple of steps I would like to improve. First, I have problems including a command for choosing samples randomly from each Group, so I currently randomise the order of samples before loading the data frame into R. Secondly, in the output table the Subgroups are numbered consecutively, but I would like to start the Subgroup numbering with 1 for each new Group. Has anybody any advice on how to achieve this?
SampleID Group CumArea Subgroups
1 A 1.5 1
77 A 4.6 1
6 A 9.3 1
43 A 16.4 1
17 A 19.5 1
67 A 2.1 2
4 A 4.3 2
32 A 8.9 2
...
300 B 4.5 10
257 B 6.8 10
397 B 10.6 10
344 B 14.5 10
367 B 16.7 10
303 B 20.1 10
306 B 1.5 11
...
A few functions in the dplyr package make this fairly straightforward. You can use slice to randomize the data, group_by to perform computations at the group level, and mutate to create new variables. If you chain the functions together with the %>% operator, I believe the solution would look something like this, assuming that you want groups that add up to 20.
install.packages("dplyr") #If you haven't used dplyr before
library(dplyr)
dat %>%
group_by(Group) %>%
slice(sample(1:n())) %>%
mutate(CumArea = cumsum(Area), SubGroup = ceiling(CumArea / 20))
I have a dataframe like the one below:
x <- data.table(Tickers=c("A","A","A","B","B","B","B","D","D","D","D"),
Type=c("put","call","put","call","call","put","call","put","call","put","call"),
Strike=c(35,37.5,37.5,10,11,11,12,40,40,42,42),
Other=sample(20,11))
Tickers Type Strike Other
1: A put 35.0 6
2: A call 37.5 5
3: A put 37.5 13
4: B call 10.0 15
5: B call 11.0 12
6: B put 11.0 4
7: B call 12.0 20
8: D put 40.0 7
9: D call 40.0 11
10: D put 42.0 10
11: D call 42.0 1
I am trying to analyze a subset of the data. The subset I would like to take is data where the ticker and strike are the same. But I also only want to grab this data if both a put and a call exists under type. With the data above for example, I would like to return the following result:
x[c(2,3,5,6,8:11),]
Tickers Type Strike Other
1: A call 37.5 5
2: A put 37.5 13
3: B call 11.0 12
4: B put 11.0 4
5: D put 40.0 7
6: D call 40.0 11
7: D put 42.0 10
8: D call 42.0 1
I'm not sure what the best way to go about doing this. My thought process is that I should create another column vector like
x$id <- paste(x$Tickers,x$Strike,sep="_")
Then use this vector to only pull values where there are multiple ids.
x[x$id %in% x$id[duplicated(x$id)],]
Tickers Type Strike Other id
1: A call 37.5 5 A_37.5
2: A put 37.5 13 A_37.5
3: B call 11.0 12 B_11
4: B put 11.0 4 B_11
5: D put 40.0 7 D_40
6: D call 40.0 11 D_40
7: D put 42.0 10 D_42
8: D call 42.0 1 D_42
I'm not sure how efficient this is, as my actual data consists of a lot more rows.
Also, this solution does not check for the type condition of there being one put and one call.
also the wording of the title could be a lot better, I apologize
EDIT::: having checked out this post Finding ALL duplicate rows, including "elements with smaller subscripts"
I could also use this solution:
x$id <- paste(x$Tickers,x$Strike,sep="_")
x[duplicated(x$id) | duplicated(x$id,fromLast=T),]
You could try something like:
x[, select := (.N >= 2 & all(c("put", "call") %in% unique(Type))), by = .(Tickers, Strike)][which(select)]
# Tickers Type Strike Other select
#1: A call 37.5 17 TRUE
#2: A put 37.5 16 TRUE
#3: B call 11.0 11 TRUE
#4: B put 11.0 20 TRUE
#5: D put 40.0 1 TRUE
#6: D call 40.0 12 TRUE
#7: D put 42.0 6 TRUE
#8: D call 42.0 2 TRUE
Another idea might be a merge:
x[x, on = .(Tickers, Strike), select := (length(Type) >= 2 & all(c("put", "call") %in% Type)),by = .EACHI][which(select)]
I'm not entirely sure how to get around the group-by operations since you want to make sure for each group they have both "call" and "put". I was thinking about using keys, but haven't been able to incorporate the "call"/"put" aspect.
An edit to your data to give a case where both put and call does not exist (I changed the very last "call" to "put"):
x <- data.table(Tickers=c("A","A","A","B","B","B","B","D","D","D","D"),
Type=c("put","call","put","call","call","put","call","put","call","put","put"),
Strike=c(35,37.5,37.5,10,11,11,12,40,40,42,42),
Other=sample(20,11))
Since you are using data.table, you can use the built in counter .N along with by variables to count groups and subset with that. If by counting Type you can reliably determine there is both put and call, this could work:
x[, `:=`(n = .N, types = uniqueN(Type)), by = c('Tickers', 'Strike')][n > 1 & types == 2]
The part enclosed in the first set of [] does the counting, and then the [n > 1 & types == 2] does the subsetting.
I am not a user of package data.table so this code is base R only.
agg <- aggregate(Type ~ Tickers + Strike, data = x, length)
result <- merge(x, subset(agg, Type > 1)[1:2], by = c("Tickers", "Strike"))[, c(1, 3, 2, 4)]
result
# Tickers Type Strike Other
#1: A call 37.5 17
#2: A put 37.5 7
#3: B call 11.0 14
#4: B put 11.0 20
#5: D put 40.0 15
#6: D call 40.0 2
#7: D put 42.0 8
#8: D call 42.0 1
rm(agg) # final clean up
I have a data.frame with one column containing categorical data, one column containing dates, and one column containing numeric values. For simplicity, see the sample below:
A B C
1 L 2015-12-01 5.7
2 M 2015-11-30 2.1
3 K 2015-11-01 3.2
4 L 2015-10-05 5.7
5 M 2015-12-05 1.2
6 L 2015-11-15 2.3
7 L 2015-12-03 4.4
I would like to, for each category in A, compute a lagging average (e.g. average of the previous 30 days' values in column C).
I cannot for the life of me figure this one out. I have tried using sapply and a custom function that subsets the data.frame on category and date (or a deep copy of it) and returns the statistic (think mean or sd) and that works fine for single values, but it returns all NA's from inside sapply.
Any help you can give is appreciated.
This could be done more compactly, but here I have drawn it out to make it easiest to understand. The core is the split, lapply/apply, and then putting it back together. It uses a date window rather than a solution based on sorting, so it is very general. I also put the object back to its original order to enable direct comparison.
# set up the data
set.seed(100)
# create a data.frame with about a two-month period for each category of A
df <- data.frame(A = rep(c("K", "L", "M"), each = 60),
B = rep(seq(as.Date("2015-01-01"), as.Date("2015-03-01"), by="days"), 3),
C = round(runif(180)*6, 1))
head(df)
## A B C
## 1 K 2015-01-01 1.8
## 2 K 2015-01-02 1.5
## 3 K 2015-01-03 3.3
## 4 K 2015-01-04 0.3
## 5 K 2015-01-05 2.8
## 6 K 2015-01-06 2.9
tail(df)
## A B C
## 175 M 2015-02-24 4.8
## 176 M 2015-02-25 2.0
## 177 M 2015-02-26 5.7
## 178 M 2015-02-27 3.9
## 179 M 2015-02-28 2.8
## 180 M 2015-03-01 3.6
# preserve original order
df$originalOrder <- 1:nrow(df)
# randomly shuffle the order
randomizedOrder <- order(runif(nrow(df)))
df <- df[order(runif(nrow(df))), ]
# split on A - your own data might need coercion of A to a factor
df.split <- split(df, df$A)
# set the window size
window <- 30
# compute the moving average
listD <- lapply(df.split, function(tmp) {
apply(tmp, 1, function(x) mean(tmp$C[tmp$B <= as.Date(x["B"]) & tmp$B (as.Date(x["B"]) - window)]))
})
# combine the result with the original data
result <- cbind(do.call(rbind, df.split), rollingMean = unlist(listD))
# and tidy up:
# return to original order
result <- result[order(result$originalOrder), ]
result$originalOrder <- NULL
# remove the row names
row.names(result) <- NULL
result[c(1:5, 59:65), ]
## A B C rollingMean
## 1 K 2015-01-01 1.8 1.800000
## 2 K 2015-01-02 1.5 1.650000
## 3 K 2015-01-03 3.3 2.200000
## 4 K 2015-01-04 0.3 1.725000
## 5 K 2015-01-05 2.8 1.940000
## 59 K 2015-02-28 3.6 3.080000
## 60 K 2015-03-01 1.3 3.066667
## 61 L 2015-01-01 2.8 2.800000
## 62 L 2015-01-02 3.9 3.350000
## 63 L 2015-01-03 5.8 4.166667
## 64 L 2015-01-04 4.1 4.150000
## 65 L 2015-01-05 2.7 3.860000
I am trying to rearrange a data frame of mine a bit in R. I have a data frame (called df) like the one below, and want to move all the values in column val_1 and val_2 from row 4 and below, one row down.
hour day val_1 val_2
1 0 31 18.3 3.2
2 1 31 16.5 3.6
3 2 31 15.7 2.7
4 3 31 16.7 2.9
5 4 31 18.0 2.1
6 5 31 18.1 1.9
The gap at row 4 in column val_1 and val_2 should be filled with NA, and the missing value in the new row 7 under hour and dayshould have the value 6 (increased by 1, compared to the value above) and 31 like shown here.
hour day val_1 val_2
1 0 31 18.3 3.2
2 1 31 16.5 3.6
3 2 31 15.7 2.7
4 3 31 NA NA
5 4 31 16.7 2.9
6 5 31 18.0 2.1
7 6 31 18.1 1.9
Hope you can help me out, so I can automate my little task here.
UPDATE AND SOLUTION:
Thanks to Nico I worked out a solution to my problem. My solution is a bit different form the suggested, since this one moves down both val_1and val_2at the same time, instead of one at a time. The code looks like this:
# Get row number where the new row should be placed after
row.no <- nrow(subset(df, hour <= 2))
# Create a NA row with matching column names (column val_1 and val_2 in this case)
new.row <- df[1, 3:ncol(df)]
new.row[] <- NA
# Create new value, hour and day section
values <- rbind(df[1:row.no, 3:ncol(df)],
new.row,
df[-1:-row.no, 3:ncol(df)]
)
hour <- c(df$hour, df$hour[nrow(df)]+1)
day <- c(df$day, df$day[nrow(df)])
# Combine everthing again
df.new <- cbind(hour, day, values)
First, let's handle val_1 and val_2
# Note the use of negative indices to "deselect" values
val_1 <- c(df$val_1[1:3], NA, df$val_1[-1:-3])
val_2 <- c(df$val_2[1:3], NA, df$val_2[-1:-3])
Now, let's add the new hour and day
hour <- c(df$hour, df$hour[length(df$hour)]+1)
# Unclear from the question where the value for day should
# be taken from. I am assuming the last one
day <- c(df$day, df$day[length(df$hour)])
And put everything back together
new.df <- data.frame(hour, day, val_1, val_2)
A possible alternative is to combine expand.grid and merge, something like the following:
First, increment all the values of "hour" after row 3 by 1.
mydf$hour[-c(1:3)] <- mydf$hour[-c(1:3)]+1
Second, since our "hour" column now has a gap, we'll need to fill it in. We can do that with seq. We'll also use expand.grid to create all combinations of the values in "hour" and "day", giving us something to merge with.
merge(mydf, expand.grid(hour = seq(0, max(mydf$hour)),
day = unique(mydf$day)), all = TRUE)
# hour day val_1 val_2
# 1 0 31 18.3 3.2
# 2 1 31 16.5 3.6
# 3 2 31 15.7 2.7
# 4 3 31 NA NA
# 5 4 31 16.7 2.9
# 6 5 31 18.0 2.1
# 7 6 31 18.1 1.9
If dat is the dataset
datNew <- setNames(as.data.frame(matrix(,ncol=4, nrow=nrow(dat)+1)),colnames(dat))
datNew[dat$hour!=3,-(1:2)] <- dat[,-(1:2)]
datNew[,1] <- (1:nrow(datNew))-1
datNew[,2] <- dat[1,2]
datNew
or
dat1 <- dat[c(1:3,NA,4:nrow(dat)),]
dat1[4:nrow(dat1),1:2] <- rbind(dat1[5:nrow(dat1),1:2], dat1[nrow(dat1),1:2]+c(1,0))
row.names(dat1) <- 1:nrow(dat1)