Group values with identical ID into columns without summerizing them in R - r

I have a dataframe that looks like this, but with a lot more Proteins
Protein z
Irak4 -2.46
Irak4 -0.13
Itk -0.49
Itk 4.22
Itk -0.51
Ras 1.53
For further operations I need the data to be grouped by Proteinname into columns like this.
Irak4 Itk Ras
-2.46 -0.49 1.53
-0.13 4.22 NA
NA -0.51 NA
I tried different packages like dplyr or reshape, but did not manage to transform the data into the desired format.
Is there any way to achieve this? I think the missing datapoints for some Proteins are the main problem here.
I am quite new to R, so my apologies if I am missing an obvious solution.

Here is an option with tidyverse
library(tidyverse)
DF %>%
group_by(Protein) %>%
mutate(idx = row_number()) %>%
spread(Protein, z) %>%
select(-idx)
# A tibble: 3 x 3
# Irak4 Itk Ras
# <dbl> <dbl> <dbl>
#1 -2.46 -0.49 1.53
#2 -0.13 4.22 NA
#3 NA -0.51 NA
Before we spread the data, we need to create unique identifiers.
In base R you could use unstack first which will give you a named list of vectors that contain the values in the z column.
Use lapply to iterate over that list and append the vectors with NAs using the `length<-` function in order to have a list of vectors with equal lengths. Then we can call data.frame.
lst <- unstack(DF, z ~ Protein)
data.frame(lapply(lst, `length<-`, max(lengths(lst))))
# Irak4 Itk Ras
#1 -2.46 -0.49 1.53
#2 -0.13 4.22 NA
#3 NA -0.51 NA
data
DF <- structure(list(Protein = c("Irak4", "Irak4", "Itk", "Itk", "Itk",
"Ras"), z = c(-2.46, -0.13, -0.49, 4.22, -0.51, 1.53)), .Names = c("Protein",
"z"), class = "data.frame", row.names = c(NA, -6L))

library(data.table)
dcast(setDT(df),rowid(Protein)~Protein,value.var='z')
Protein Irak4 Itk Ras
1: 1 -2.46 -0.49 1.53
2: 2 -0.13 4.22 NA
3: 3 NA -0.51 NA
in base R you can do:
data.frame(sapply(a<-unstack(df,z~Protein),`length<-`,max(lengths(a))))
Irak4 Itk Ras
1 -2.46 -0.49 1.53
2 -0.13 4.22 NA
3 NA -0.51 NA
Or using reshape:
reshape(transform(df,gr=ave(z,Protein,FUN=seq_along)),v.names = 'z',timevar = 'Protein',idvar = 'gr',dir='wide')
gr z.Irak4 z.Itk z.Ras
1 1 -2.46 -0.49 1.53
2 2 -0.13 4.22 NA
5 3 NA -0.51 NA

Related

standardize a variable values differently based on another categorical variable in R (Using R Base)

I have a large dataset that has a continuous variable "Cholesterol" for two visits for each participant (each participant has two rows: first visit = Before & second visit= After). I'd like to standadise cholesterol but I have both Before and After visits merged which will not make my standardisation accurate as it is calculated using the mean and the SD
USING R BASE, How can I create a new cholesterol variable standardised based on Visit in the same data set (in this process standardisation should be done twice; once for Before and another time for After, but the output (standardised values) will be in a one variable again following the same structure of this DF
DF$Cholesterol<- c( 0.9861551,2.9154158, 3.9302373,2.9453085, 4.2248018,2.4789901, 0.9972635, 0.3879830, 1.1782336, 1.4065341, 1.0495609,1.2750138, 2.8515144, 0.4369885, 2.2410429, 0.7566147, 3.0395565,1.7335131, 1.9242212, 2.4539439, 2.8528908, 0.8432039,1.7002653, 2.3952744,2.6522959, 1.2178764, 2.3426695, 1.9030782,1.1708246,2.7267124)
DF$Visit< -c(Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before, After,Before,After,Before,After)
# the standardisation function I want to apply
standardise <- function(x) {return((x-min(x,na.rm = T))/sd(x,na.rm = T))}
thank you in advance
Let's make your data, fix the df$visit assignment, fix the standardise function to be mean rather than min, and then assume each new occasion of before is the next person, pivot to wide format, then mutate our before and after standardised variables:
df <- data.frame(x = rep(1, 30))
df$cholesterol<- c( 0.9861551,2.9154158, 3.9302373,2.9453085, 4.2248018,2.4789901, 0.9972635, 0.3879830, 1.1782336, 1.4065341, 1.0495609,1.2750138, 2.8515144, 0.4369885, 2.2410429, 0.7566147, 3.0395565,1.7335131, 1.9242212, 2.4539439, 2.8528908, 0.8432039,1.7002653, 2.3952744,2.6522959, 1.2178764, 2.3426695, 1.9030782,1.1708246,2.7267124)
df$visit <- rep(c("before", "after"), 15)
standardise <- function(x) {return((x-mean(x,na.rm = T))/sd(x,na.rm = T))}
df <- df %>%
mutate(person = cumsum(visit == "before"))%>%
pivot_wider(names_from = visit, id_cols = person, values_from = cholesterol)%>%
mutate(before_std = standardise(before),
after_std = standardise(after))
gives:
person before after before_std after_std
<int> <dbl> <dbl> <dbl> <dbl>
1 1 0.986 2.92 -1.16 1.33
2 2 3.93 2.95 1.63 1.36
3 3 4.22 2.48 1.91 0.842
4 4 0.997 0.388 -1.15 -1.49
5 5 1.18 1.41 -0.979 -0.356
6 6 1.05 1.28 -1.10 -0.503
7 7 2.85 0.437 0.609 -1.44
8 8 2.24 0.757 0.0300 -1.08
9 9 3.04 1.73 0.788 0.00940
10 10 1.92 2.45 -0.271 0.814
11 11 2.85 0.843 0.611 -0.985
12 12 1.70 2.40 -0.483 0.749
13 13 2.65 1.22 0.420 -0.567
14 14 2.34 1.90 0.126 0.199
15 15 1.17 2.73 -0.986 1.12
If you actually want min in your standardise function rather than mean, editing it should be simple enough.
Edited for BaseR solution, but with a cautionary tale that there's probably a much neater solution:
df <- data.frame(id = rep(c(seq(1, 15, 1)), each = 2))
df$cholesterol<- c( 0.9861551,2.9154158, 3.9302373,2.9453085, 4.2248018,2.4789901, 0.9972635, 0.3879830, 1.1782336, 1.4065341, 1.0495609,1.2750138, 2.8515144, 0.4369885, 2.2410429, 0.7566147, 3.0395565,1.7335131, 1.9242212, 2.4539439, 2.8528908, 0.8432039,1.7002653, 2.3952744,2.6522959, 1.2178764, 2.3426695, 1.9030782,1.1708246,2.7267124)
df$visit <- rep(c("before", "after"), 15)
df <- reshape(df, direction = "wide", idvar = "id", timevar = "visit")
standardise <- function(x) {return((x-mean(x,na.rm = T))/sd(x,na.rm = T))}
df$before_std <- round(standardise(df$cholesterol.before), 2)
df$aafter_std <- round(standardise(df$cholesterol.after), 2)
gives:
i id cholesterol.before cholesterol.after before_std after_std
1 1 0.9861551 2.9154158 -1.16 1.33
3 2 3.9302373 2.9453085 1.63 1.36
5 3 4.2248018 2.4789901 1.91 0.84
7 4 0.9972635 0.3879830 -1.15 -1.49
9 5 1.1782336 1.4065341 -0.98 -0.36
11 6 1.0495609 1.2750138 -1.10 -0.50
13 7 2.8515144 0.4369885 0.61 -1.44
15 8 2.2410429 0.7566147 0.03 -1.08
17 9 3.0395565 1.7335131 0.79 0.01
19 10 1.9242212 2.4539439 -0.27 0.81
21 11 2.8528908 0.8432039 0.61 -0.99
23 12 1.7002653 2.3952744 -0.48 0.75
25 13 2.6522959 1.2178764 0.42 -0.57
27 14 2.3426695 1.9030782 0.13 0.20
29 15 1.1708246 2.7267124 -0.99 1.12

populate values from another data frame based on predefined set of columns

I have two data frames. The first one look like that:
df1 <- data.frame(Hugo_Symbol=c("CDKN2A", "JUN", "IRS2","MTOR",
"NRAS"),
A183=c(-0.19,NA,2.01,0.4,1.23),
A185=c(0.11,2.45,NA,NA,1.67),
A186=c(1.19,NA,2.41,0.78,1.93),
A187=c(2.78,NA,NA,0.7,2.23),
A188=c(NA,NA,NA,2.4,1.23))
head(df1)
Hugo_Symbol A183 A185 A186 A187 A188
1 CDKN2A -0.19 0.11 1.19 2.78 NA
2 JUN NA 2.45 NA NA NA
3 IRS2 2.01 NA 2.41 NA NA
4 MTOR 0.40 NA 0.78 0.70 2.40
5 NRAS 1.23 1.67 1.93 2.23 1.23
The second data frame is smaller and have an empty values:
df2 <- data.frame(Hugo_Symbol=c("CDKN2A", "IRS2", "NRAS"),
A183=c(0, 0, 0),
A187=c(0, 0, 0),
A188=c(0, 0, 0))
head(df2)
Hugo_Symbol A183 A187 A188
1 CDKN2A 0 0 0
2 IRS2 0 0 0
3 NRAS 0 0 0
I would like to populate the second data frame with values from the first data frame. The final result will look like that:
Hugo_Symbol A183 A187 A188
1 CDKN2A -0.19 2.78 NA
2 IRS2 2.01 NA NA
3 NRAS 1.23 2.23 1.23
I tried cbind() and merge() functions, but they do not work on data with different number of raws and columns.
I would appreciate any help!
Thank you!
Olha
I don't get the logic of your output, I hope you wrote it wrong, but I think you want the following:
matchedRowInds <- match(df2$Hugo_Symbol,df1$Hugo_Symbol)
matchedColInds <- match(colnames(df2),colnames(df1))
newdf <- df1[matchedRowInds,matchedColInds]
# > newdf
# Hugo_Symbol A183 A187 A188
# 1 CDKN2A -0.19 2.78 NA
# 3 IRS2 2.01 NA NA
# 5 NRAS 1.23 2.23 1.23
Idea: Get the matching rows in the bigged dataframe which are present in the smaller. Same with columns.
You can use semi_join from dplyr:
your final table has unexpected values.
my version:
library(dplyr)
df3 <- df1 %>% semi_join(df2, by="Hugo_Symbol") %>%
select(Hugo_Symbol, A183, A187, A188)
here is a data.table approach... are you sure your desires output in the question is correct? seems to me like IRS2 - A188 should be NA and not 2.23 ?
library( data.table )
#make them both data.tables
setDT(df1); setDT(df2)
#find the common columns
comcols <- intersect( names(df1[,-1]), names(df2[,-1]) )
#create a data.table syntax for an update join on the common columns
expr <- paste0( "df2[ df1, `:=` (",
paste0( comcols, " = i.", comcols, collapse = " ," ),
" ), on = .(Hugo_Symbol) ]" )
eval(parse(text=expr))
df2
# Hugo_Symbol A183 A187 A188
# 1: CDKN2A -0.19 2.78 NA
# 2: IRS2 2.01 NA NA
# 3: NRAS 1.23 2.23 1.23

Extracting the slope for individual observation

I'm a newbie in R. I have a data set with 3 set of lung function measurements for 3 corresponding dates given below for each observation. I would like to extract slope for each observation(decline in lung function) using R software and insert in the new column for each observation.
1. How should I approach the problem?
2. Is my data set arranged in right format?
ID FEV1_Date11 FEV1_Date12 FEV1_Date13 DATE11 DATE12 DATE13
18105 1.35 1.25 1.04 6/9/1990 8/16/1991 8/27/1993
18200 0.87 0.85 9/12/1991 3/11/1993
18303 0.79 4/23/1992
24204 4.05 3.95 3.99 6/8/1992 3/22/1993 11/5/1994
28102 1.19 1.04 0.96 10/31/1990 7/24/1991 6/27/1992
34104 1.03 1.16 1.15 7/25/1992 12/8/1993 12/7/1994
43108 0.92 0.83 0.79 6/23/1993 1/12/1994 1/11/1995
103114 2.43 2.28 2.16 6/5/1994 6/21/1995 4/7/1996
114101 0.73 0.59 0.6 6/25/1989 8/5/1990 8/24/1991
example for 1st observation, slope=0.0003
Thanks..
If I understood the question, I think you want the slope between each set of visits:
library(dplyr)
group_by(df, ID) %>%
mutate_at(vars(starts_with("DATE")), funs(as.Date(., "%m/%d/%Y"))) %>%
do(data_frame(slope=diff(unlist(.[,2:4]))/diff(unlist(.[,5:7])),
after_visit=1+(1:length(slope))))
## Source: local data frame [18 x 3]
## Groups: ID [9]
##
## ID slope after_visit
## <int> <dbl> <dbl>
## 1 18105 -2.309469e-04 2
## 2 18105 -2.830189e-04 3
## 3 18200 -3.663004e-05 2
## 4 18200 NA 3
## 5 18303 NA 2
## 6 18303 NA 3
## 7 24204 -3.484321e-04 2
## 8 24204 6.745363e-05 3
## 9 28102 -5.639098e-04 2
## 10 28102 -2.359882e-04 3
## 11 34104 2.594810e-04 2
## 12 34104 -2.747253e-05 3
## 13 43108 -4.433498e-04 2
## 14 43108 -1.098901e-04 3
## 15 103114 -3.937008e-04 2
## 16 103114 -4.123711e-04 3
## 17 114101 -3.448276e-04 2
## 18 114101 2.604167e-05 3
Alternate munging:
group_by(df, ID) %>%
mutate_at(vars(starts_with("DATE")), funs(as.Date(., "%m/%d/%Y"))) %>%
do(data_frame(date=as.Date(unlist(.[,5:7]), origin="1970-01-01"), # in the event you wanted to keep the data less awful and have one observation per row, this preserves the Date class
reading=unlist(.[,2:4]))) %>%
do(data_frame(slope=diff(.$reading)/unclass(diff(.$date))))
This is a bit of a "hacky" solution but if I understand your question correctly (some clarification may be needed), this should work in your case. Note, this is somewhat specific to your case since the column pairs are expected to be in the order you specified.
library(dplyr)
library(lubridate)
### Load Data
tdf <- read.table(header=TRUE, stringsAsFactors = FALSE, text = '
ID FEV1_Date11 FEV1_Date12 FEV1_Date13 DATE11 DATE12 DATE13
18105 1.35 1.25 1.04 6/9/1990 8/16/1991 8/27/1993
18200 0.87 0.85 NA 9/12/1991 3/11/1993 NA
18303 0.79 NA NA 4/23/1992 NA NA
24204 4.05 3.95 3.99 6/8/1992 3/22/1993 11/5/1994
28102 1.19 1.04 0.96 10/31/1990 7/24/1991 6/27/1992
34104 1.03 1.16 1.15 7/25/1992 12/8/1993 12/7/1994
43108 0.92 0.83 0.79 6/23/1993 1/12/1994 1/11/1995
103114 2.43 2.28 2.16 6/5/1994 6/21/1995 4/7/1996
114101 0.73 0.59 0.6 6/25/1989 8/5/1990 8/24/1991') %>% tbl_df
#####################################
### Reshape the data by column pairs.
#####################################
### Function to reshape a single column pair
xform_data <- function(x) {
df<-data.frame(tdf[,'ID'],
names(tdf)[x],
tdf[,names(tdf)[x]],
tdf[,names(tdf)[x+3]], stringsAsFactors = FALSE)
names(df) <- c('ID', 'DateKey', 'Val', 'Date'); df
}
### Create a new data frame with the data in a deep format (i.e. reshaped)
### 'lapply' is used to reshape each pair of columns (date and value).
### 'lapply' returns a list of data frames (on df per pair) and 'bind_rows'
### combines them into one data frame.
newdf <-
bind_rows(lapply(2:4, function(x) {xform_data(x)})) %>%
mutate(Date = mdy(Date, tz='utc'))
#####################################
### Calculate the slopes per ID
#####################################
slopedf <-
newdf %>%
arrange(DateKey, Date) %>%
group_by(ID) %>%
do(slope = lm(Val ~ Date, data = .)$coefficients[[2]]) %>%
mutate(slope = as.vector(slope)) %>%
ungroup
slopedf
## # A tibble: 9 x 2
## ID slope
## <int> <dbl>
## 1 18105 -3.077620e-09
## 2 18200 -4.239588e-10
## 3 18303 NA
## 4 24204 -5.534095e-10
## 5 28102 -4.325210e-09
## 6 34104 1.690414e-09
## 7 43108 -2.490139e-09
## 8 103114 -4.645589e-09
## 9 114101 -1.924497e-09
##########################################
### Adding slope column to original data.
##########################################
> tdf %>% left_join(slopedf, by = 'ID')
## # A tibble: 9 x 8
## ID FEV1_Date11 FEV1_Date12 FEV1_Date13 DATE11 DATE12 DATE13 slope
## <int> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 18105 1.35 1.25 1.04 6/9/1990 8/16/1991 8/27/1993 -3.077620e-09
## 2 18200 0.87 0.85 NA 9/12/1991 3/11/1993 <NA> -4.239588e-10
## 3 18303 0.79 NA NA 4/23/1992 <NA> <NA> NA
## 4 24204 4.05 3.95 3.99 6/8/1992 3/22/1993 11/5/1994 -5.534095e-10
## 5 28102 1.19 1.04 0.96 10/31/1990 7/24/1991 6/27/1992 -4.325210e-09
## 6 34104 1.03 1.16 1.15 7/25/1992 12/8/1993 12/7/1994 1.690414e-09
## 7 43108 0.92 0.83 0.79 6/23/1993 1/12/1994 1/11/1995 -2.490139e-09
## 8 103114 2.43 2.28 2.16 6/5/1994 6/21/1995 4/7/1996 -4.645589e-09
## 9 114101 0.73 0.59 0.60 6/25/1989 8/5/1990 8/24/1991 -1.924497e-09

How to make "lists" or "vectors" of data frames

The short story about what I need: I have read in a CSV file, and I want to take some of the columns and store them into variables as their own data frame, and then store the variables into a list. However, when I use c() to do this, it just puts all the data in a flat vector. Is there a way to have a list of data frames?
The longer story: I have read in a CSV file, suppose it looks like this
,"Date","px high","px low","px last",,,,"Date","px high","px low","px last"
"eur curncy",03/Jan/2000,1.03,1.01,1.02,,,"gbp curncy",03/Jan/2000,1.64,1.61,1.64
,1/4/2000,1.03,1.02,1.03,,,,1/4/2000,1.64,1.63,1.64
,1/5/2000,1.04,1.03,1.03,,,,1/5/2000,1.65,1.64,"#N/A N/A"
,1/6/2000,1.04,1.03,1.03,,,,1/7/2000,1.65,1.64,1.65
When I store the read CSV file and print the variable it looks like
Date px.high px.low px.last Date.1 px.high.1 px.low px.last
eur curncy 03/Jan/2000 1.03 1.02 1.03 03/Jan/2000 1.64 1.63 1.64
1/4/2000 1.03 1.02 1.03 1/4/2000 1.64 1.63 1.64
... etc.
I have shaved off a lot of the data for this example to avoid clutter, but there are many more rows and columns of this data. Along the columns they repeat in these groups each with a date, px high, etc. Along the rows you more or less get the same as in the last couple rows shown above.
I ultimately want to go into each group of data, segment it into months, compute the average values for each month in each column, and throw away the daily information, and then make a bar chart for each group. However, I have the following problems that I need to solve:
The first row of dates are in a different format from the other rows. All the rows after the first row are in the same format. I can pretty well fix this myself by reading in the data as
cur <- read.csv('C:\\file.csv', stringsAsFactors=FALSE)
and then looping over the columns, assigning in the right places
cur[1,col] <- as.character(as.date(cur[1,col], format='%d/%b/%Y'))
I can then format the rest of the date entries by looping over rows and then columns and basically do the same thing again.
Some of the entries in the CSV file contain the string "#N/A N/A" which I've found will force R to read every other entry in that column as a string, so that I can no longer perform arithmetic on the objects. I'm fine with just throwing away those rows of data that have this on them, but even when doing so the columns remain strings. Also, if I throw the row away from one of these groups, I throw away the whole row for all of the rest of the data, which I don't want to do.
The arithmetic problem is easy to solve, when I do the arithmetic I just convert everything to a numeric. It might be inefficient but it seems to have worked well enough. But the issue of all of these rows being together in the same data frame so that if I throw away one row I also throw away all other data on that row--and sometimes the dates of the groups don't match. So if I throw away a row that has an "#N/A N/A" on it for one date, I'll be throwing away other dates for other groups, which I don't want. Hence the best solution I can think of is to split the groups into their own data frames and treat them somewhat separately.
Some of the data have mis-matching dates. I want to basically throw away any date from any one of these groups of data, if that date is not shared by all the data. But again I only want to do this for the same date in all groups--I can't just delete a row because again that row may correspond to one date in one group but another date in another group. So again it seems like splitting the groups is the thing to do.
But if anyone thinks there's a better way to go, let me know.
To answer your question about lists, yes, you can store data frames in lists:
l <- list(dat1, dat2, dat3, etc.)
If you have odd NA values, (999, -1, -11, #N/A, etc), you can use na.strings to catch those and keep your columns as numerics:
(dat <- read.csv(header = TRUE, na.strings = c('#N/A N/A'),
stringsAsFactors = FALSE,
text="Date,px high,px low,px last,
03/Jan/2000,1.03,1.01,1.02,
03/Jan/2000,1.64,1.61,1.64,
1/4/2000,1.03,1.02,1.03,
1/4/2000,1.64,1.63,1.64,
1/5/2000,1.04,1.03,1.03,
1/5/2000,1.65,1.64,#N/A N/A,
1/6/2000,1.04,1.03,1.03,
1/7/2000,1.65,1.64,1.65")[1:4])
# Date px.high px.low px.last
# 1 03/Jan/2000 1.03 1.01 1.02
# 2 03/Jan/2000 1.64 1.61 1.64
# 3 1/4/2000 1.03 1.02 1.03
# 4 1/4/2000 1.64 1.63 1.64
# 5 1/5/2000 1.04 1.03 1.03
# 6 1/5/2000 1.65 1.64 NA
# 7 1/6/2000 1.04 1.03 1.03
# 8 1/7/2000 1.65 1.64 1.65
Like you said, there is mixed formatting with the dates, so I use this crude function to check which format is used and tell R the proper one to use:
f_dat <- function(x)
as.Date(x, format = ifelse(is.na(as.numeric(gsub('/','',x))),
'%d/%b/%Y', '%d/%m/%Y'))
## and format the dates:
(dat <- within(dat, {
Date <- f_dat(Date)
}))
# Date px.high px.low px.last
# 1 2000-01-03 1.03 1.01 1.02
# 2 2000-01-03 1.64 1.61 1.64
# 3 2000-04-01 1.03 1.02 1.03
# 4 2000-04-01 1.64 1.63 1.64
# 5 2000-05-01 1.04 1.03 1.03
# 6 2000-05-01 1.65 1.64 NA
# 7 2000-06-01 1.04 1.03 1.03
# 8 2000-07-01 1.65 1.64 1.65
EDIT
dat <- read.csv(header = TRUE, na.strings = c('#N/A N/A'),
stringsAsFactors = FALSE,
text=",Date,px high,px low,px last,,,,Date,px high,px low,px last
eur curncy,03/Jan/2000,1.03,1.01,1.02,,,gbp curncy,03/Jan/2000,1.64,1.61,1.64
,1/4/2000,1.03,1.02,1.03,,,,1/4/2000,1.64,1.63,1.64
,1/5/2000,1.04,1.03,1.03,,,,1/5/2000,1.65,1.64,#N/A N/A
,1/6/2000,1.04,1.03,1.03,,,,1/7/2000,1.65,1.64,1.65")
# X Date px.high px.low px.last X.1 X.2 X.3 Date.1 px.high.1 px.low.1 px.last.1
# 1 eur curncy 03/Jan/2000 1.03 1.01 1.02 NA NA gbp curncy 03/Jan/2000 1.64 1.61 1.64
# 2 1/4/2000 1.03 1.02 1.03 NA NA 1/4/2000 1.64 1.63 1.64
# 3 1/5/2000 1.04 1.03 1.03 NA NA 1/5/2000 1.65 1.64 NA
# 4 1/6/2000 1.04 1.03 1.03 NA NA 1/7/2000 1.65 1.64 1.65
f_dat <- function(x)
as.Date(x, format = ifelse(is.na(as.numeric(gsub('/','',x))),
'%d/%b/%Y', '%d/%m/%Y'))
(dat <- within(dat, {
Date <- f_dat(Date)
Date.1 <- f_dat(Date.1)
}))
# X Date px.high px.low px.last X.1 X.2 X.3 Date.1 px.high.1 px.low.1 px.last.1
# 1 eur curncy 2000-01-03 1.03 1.01 1.02 NA NA gbp curncy 2000-01-03 1.64 1.61 1.64
# 2 2000-04-01 1.03 1.02 1.03 NA NA 2000-04-01 1.64 1.63 1.64
# 3 2000-05-01 1.04 1.03 1.03 NA NA 2000-05-01 1.65 1.64 NA
# 4 2000-06-01 1.04 1.03 1.03 NA NA 2000-07-01 1.65 1.64 1.65

tm package: Output of findAssocs() in a matrix instead of a list in R

Consider the following list:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
a <- findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1))
How do I manage to have a data frame with all terms associated with these 3 words in the columns and showing:
The corresponding correlation coefficient (if it exists)
NA if it does not exists for this word (for example the couple (oil, they) would show NA)
Here's a solution using reshape2 to help reshape the data
library(reshape2)
aa<-do.call(rbind, Map(function(d, n)
cbind.data.frame(
xterm=if (length(d)>0) names(d) else NA,
cor=if(length(d)>0) d else NA,
term=n),
a, names(a))
)
dcast(aa, term~xterm, value.var="cor")
Or you could use dplyr and tidyr
library(dplyr)
library('devtools')
install_github('hadley/tidyr')
library(tidyr)
a1 <- unnest(lapply(a, function(x) data.frame(xterm=names(x),
cor=x, stringsAsFactors=FALSE)), term)
a1 %>%
spread(xterm, cor) #here it removed terms without any `cor` for the `xterm`
# term 15.8 ability above agreement analysts buyers clearly emergency fixed
#1 oil 0.87 NA 0.76 0.71 0.79 0.70 0.8 0.75 0.73
#2 opec 0.85 0.8 0.82 0.76 0.85 0.83 NA 0.87 NA
# late market meeting prices prices. said that they trying who winter
#1 0.8 0.75 0.77 0.72 NA 0.78 0.73 NA 0.8 0.8 0.8
#2 NA NA 0.88 NA 0.79 0.82 NA 0.8 NA NA NA
Update
aNew <- sapply(tdm$dimnames$Terms, function(i) findAssocs(tdm, i, corlimit=0.95))
aNew2 <- aNew[!!sapply(aNew, function(x) length(dim(x)))]
aNew3 <- unnest(lapply(aNew2, function(x) data.frame(xterm=rownames(x),
cor=x[,1], stringsAsFactors=FALSE)[1:3,]), term)
res <- aNew3 %>%
spread(xterm, cor)
dim(res)
#[1] 1021 160
res[1:3,1:5]
# term ... 100,000 10.8 1.1
#1 ... NA NA NA NA
#2 100,000 NA NA NA 1
#3 10.8 NA NA NA NA

Resources