Creating a function to split data frames multiple times then recombine - r

I'm working on a large dataset in R with 3 factors: FY (6 levels), Region (10 levels), and Service (24 levels). I need to sum my numeric vector, SumOfUnits, at all three levels, and the only way I can think to do this is to split the data frames up into first: 6 data frames, split by FY, then split those 6 into 10 data frames, split on region, then those 10 into the 24 Services, then I can finally take the sum of the numeric vector and recombine all of the data frames into one. This data frame would then have 6*10*24 (1440) rows and 4 columns. The way I'm currently doing it involves a lot of splitting, so I thought there might be a function I could write that I could use at each level of the split, but I haven't used "function" very much in R so I'm not sure what to write (if there even is something). I also imagine there is probably a more efficient way to get the formatted data set, so I welcome all suggestions.
Here are a few lines from my data frame:
FY Region Service SumOfUnits
1 2006 1 Medication 13
2 2006 1 Medication 1
3 2006 1 Screening & Assessment 38
4 2006 1 Screening & Assessment 13
5 2006 1 Screening & Assessment 41
6 2006 1 Screening & Assessment 67
7 2006 1 Screening & Assessment 222
8 2006 1 Residential Treatment 38
9 2006 1 Residential Treatment 1558
This is the code I've been using for my splits:
# Creating a data frame by year
X <- split(MIC, MIC$FY)
Y <- lapply(seq_along(X), function(x) as.data.frame(X[[x]])[, ])
#Assign the dataframes in the list Y to individual objects
A <- Y[[1]]
B <- Y[[2]]
C <- Y[[3]]
D <- Y[[4]]
E <- Y[[5]]
Q <- Y[[6]]
#Creating 10 dataframes from 2006 split by region
X <- split(A, A$Region)
Y <- lapply(seq_along(X), function(x) as.data.frame(X[[x]])[, ])
Reg1 <- Y[[1]]
Reg2 <- Y[[2]]
Reg3<- Y[[3]]
Reg4 <- Y[[4]]
Reg5<- Y[[5]]
Reg6 <- Y[[6]]
Reg7 <- Y[[7]]
Reg8 <- Y[[8]]
Reg9 <- Y[[9]]
Reg10<- Y[[10]]
#Creating 24 dataframes: for 2006, region 1
X <- split(Reg1, Reg1$Service)
Y <- lapply(seq_along(X), function(x) as.data.frame(X[[x]])[, ])
Serv1 <- Y[[1]]
Serv2 <- Y[[2]]
Serv3<- Y[[3]]
Serv4 <- Y[[4]]
Serv5<- Y[[5]]
#etc...
I would want a sample of my data to look something like this:
FY Region Service SumOfUnits
2006 1 Medication 4300
2006 2 Medication 3299
2006 3 Medication 2198
2007 1 Medication 5467
2007 2 Medication 3214
2007 3 Medication 9807

this is quite nice function to do this:
library(plyr)
ddply(MIC, .(FY, Region, Service), summarize, sumOfUnits=sum(SumOfUnits))
it gives back exactly what you need.
For MIC =
FY Region Service SumOfUnits
1 2006 1 A 1
2 2006 2 B 4
3 2007 1 C 3
4 2007 2 D 2
5 2007 2 E 7
6 2006 1 A 3
7 2007 1 D 3
8 2007 2 B 4
9 2007 2 B 6
returns:
FY Region Service sumOfUnits
1 2006 1 A 4
2 2006 2 B 4
3 2007 1 C 3
4 2007 1 D 3
5 2007 2 B 10
6 2007 2 D 2
7 2007 2 E 7

Related

Comparing items in a list to a dataset in R

I have a large dataset (8,000 obs) and about 16 lists with anywhere from 120 to 2,000 items. Essentially, I want to check to see if any of the observations in the dataset match an item in a list. If there is a match, I want to include a variable indicating the match.
As an example, if I have data that look like this:
dat <- as.data.frame(1:10)
list1 <- c(2:4)
list2 <- c(7,8)
I want to end with a dataset that looks something like this
Obs Var List
1 1
2 2 1
3 3 1
4 4 1
5 5
6 6
7 7 2
8 8 2
9 9
10 10
How do I go about doing this? Thank you!
Here is one way to do it using boolean sum and %in%. If several match, then the last one is taken here:
dat <- data.frame(Obs = 1:10)
list_all <- list(c(2:4), c(7,8))
present <- sapply(1:length(list_all), function(n) dat$Obs %in% list_all[[n]]*n)
dat$List <- apply(present, 1, FUN = max)
dat$List[dat$List == 0] <- NA
dat
> dat
Obs List
1 1 NA
2 2 1
3 3 1
4 4 1
5 5 NA
6 6 NA
7 7 2
8 8 2
9 9 NA
10 10 NA

Data Masking in Dataframe [duplicate]

This question already has answers here:
How to create example data set from private data (replacing variable names and levels with uninformative place holders)?
(3 answers)
Closed 3 years ago.
I have a dataframe with 8 unique values
data<-data.frame(id=c("ab","cc","cc","dd","ee","ff","ee","ff","ab","dd","gg",1,"air"))
>data
id
1 ab
2 cc
3 cc
4 dd
5 ee
6 ff
7 ee
8 ff
9 ab
10 dd
11 gg
12 1
13 air
I create another dataframe holding 8 unique values that are to be used as replacements
library(random)
replacements<-data.frame(value=randomStrings(n=8, len=2, digits=FALSE,loweralpha=TRUE, unique=TRUE, check=TRUE))
replacements
V1
1 SJ
2 fH
3 TZ
4 Mr
5 oZ
6 kZ
7 fe
8 ql
I want to replace all unique values from data dataframe with values in replacement dataframe in below way
All ab values replaced by SJ
All cc values replaced by fH
All dd values replaced by TZ
All ee values replaced by Mr
All ff values replaced by oZ
All gg values replaced by kZ
All 1 values replaced by fe
All air values replaced by ql
Currently, I am achieving this by:
data<-data.frame(id=c("ab","cc","cc","dd","ee","ff","ee","ff","ab","dd","gg",1,"air"))
data$id<-as.character(data$id)
replacements<-data.frame(value=randomStrings(n=8, len=2, digits=FALSE,loweralpha=TRUE, unique=TRUE, check=TRUE))
replacements$V1<-as.character(replacements$V1)
for(i in 1:length(unique(data$id))){
data$id[data$id %in% data$id[i]] <- replacements$V1[i]
}
>data
id
1 SJ
2 fH
3 fH
4 TZ
5 Mr
6 oZ
7 Mr
8 oZ
9 SJ
10 TZ
11 kZ
12 fe
13 ql
Is there any base function in R to achieve? Is there better approach than this for masking data?
I would suggest using merge(), but to do that you would first need to add a column of unique data$id to replacements, as both data.frames need to have a column in common.
Here's data:
> data
id
1 ab
2 cc
3 cc
4 dd
5 ee
6 ff
7 ee
8 ff
9 ab
10 dd
11 gg
12 1
13 air
Here's replacements:
> replacements
V1
1 VS
2 Of
3 bH
4 iJ
5 jm
6 kH
7 cm
8 rQ
So add unique data$id to replacements:
replacements$id <- unique(data$id)
Giving:
V1 id
1 VS ab
2 Of cc
3 bH dd
4 iJ ee
5 jm ff
6 kH gg
7 cm 1
8 rQ air
Then merge data with replacements using id:
data <- merge(data, replacements, by = "id", all.x = TRUE, sort = FALSE)
Giving:
id V1
1 ab VS
2 ab VS
3 cc Of
4 cc Of
5 dd bH
6 dd bH
7 ee iJ
8 ee iJ
9 ff jm
10 ff jm
11 gg kH
12 1 cm
13 air rQ
If you really wanted to keep only the new id column, you could drop the original id and rename the new column:
data <- data[, 2, drop = FALSE]
colnames(data) <- "id"
Giving:
id
1 VS
2 VS
3 Of
4 Of
5 bH
6 bH
7 iJ
8 iJ
9 jm
10 jm
11 kH
12 cm
13 rQ
Masking data using algorithm CRC32
library(data.table)
library(digest)
data<-data.frame(id=c("ab","cc","cc","dd","ee","ff","ee","ff","ab","dd","gg",1,"air"))
setDT(data)
anonymize <- function(x, algo="crc32"){
unq_hashes <- vapply(unique(x), function(object) digest(object, algo=algo), FUN.VALUE="", USE.NAMES=TRUE)
unname(unq_hashes[x])
}
cols_to_mask <- c("id")
data[,cols_to_mask := lapply(.SD, anonymize),.SDcols=cols_to_mask,with=FALSE]
References:Data anonymization in R

convert individual response data-frame into team-level data-frame in r

I have the following data-frames:
> team_1_A
MemberA Q1 Q2
1 C 2 3
2 B 3 4
> team_1_B
MemberB Q1 Q2
1 A 5 4
2 C 5 2
> team_1_C
MemberC Q1 Q2
1 A 2 5
2 B 5 5
These data-frames correspond to individual responses regarding their perceptions of team members. For example, the first data-frame is how team member A (in team 1) perceives team members B and C for questions 1 and 2 (Q1 and Q2). My goal is to automate a process that converts these data-frames from individual responses to team-level data-frames for each question, so that what would be obtained for these data-frames would be:
> T1Q1
X A B C
1 A 0 3 2
2 B 5 0 5
3 C 2 5 0
> T1Q2
X A B C
1 A 0 3 2
2 B 5 0 5
3 C 2 5 0
Thus, 0s appear along the diagonal, and row by row is what each member said about the other for a given question. For example, in T1Q1 in row1 we see A's perceptions of B and C.
The reshape2 package will make your life easy for this task:
rm(list=ls())
library(reshape2)
team_1_A <- data.frame(MemberA=c('C', 'B'), Q1=c(2,3), Q2=c(3,4))
team_1_B <- data.frame(MemberB=c('A', 'C'), Q1=c(5,5), Q2=c(4,2))
team_1_C <- data.frame(MemberC=c('A', 'B'), Q1=c(2,5), Q2=c(5,5))
# store data frames in a list
members <- list(team_1_A, team_1_B, team_1_C)
# format rows/columns
formatted <- lapply(members, function(m){
m$Respondent <- gsub('Member', '', names(m)[1])
names(m)[1] <- 'TeamMate'
return(m)
})
formatted <- do.call('rbind', formatted)
# separate questions into different data frames
questionList <- lapply(c(1,2), function(q) formatted[,c('Respondent', 'TeamMate', paste0('Q',q))])
# reshape, ensure order is correct
questionList <- lapply(questionList, function(q) {
q <- dcast(q, Respondent ~ TeamMate)
q <- q[,c('Respondent', 'A', 'B', 'C')]
return(q)
})
names(questionList) <- c('T1Q1', 'T1Q2')
# replace NA with 0
# etc...
questionList
$T1Q1
Respondent A B C
1 A NA 3 2
2 B 5 NA 5
3 C 2 5 NA
$T1Q2
Respondent A B C
1 A NA 4 3
2 B 4 NA 2
3 C 5 5 NA

Split and Diff function in R

I have a data frame called data. I am splitting the data using split function by an attribute called KEY.
data <- split(data, data$KEY);
After splitting the dataframe by KEY, what we get is data for individual firms. dataframe data had the data for all the firms in the universe. After the split, each individual split has two columns, year and sales. For each split, I have to calculate incremental sales corresponding to each year. For instance, if we have data 2002 - 10, 2003 - 12, 2004 - 15, 2005 - 20. What I am interested in getting would be 2003-2, 2004 -3, 2005 - 5, for each split.
I have written a function, called mod_sale, to perform the job mentioned:
data[with(data, order(year)),];
sale_data <- diff(data$SALE);
data <- data[-1,];
data$SALE <- sale_data;
return(data)
Currently, I am using for loop:
for(key in names(data)){
a <- try(mod_sale(data[[key]]))
if(class(a) == "try-error") next;
mod_data <- rbind(mod_data,a)};
I think there is some way, I can use sapply (and may be plyr too). Can someone help me with improving this R code? Not sure how sapply code would go.
sapply(data, mod_sale)
Any help would be appreciated. Thanks.
Edit:
Here is a data example:
a <- data.frame();
key <- c(1,1,1,1,2,2,2,2,2,3,3,3);
sales <- c(12,12,15,8,3,6,3,9,9,12,3,7);
year <- c(2002,2003,2004,2005,2001,2002,2003,2004,2005,2003,2004,2005);
ovar <- runif(12,5.0,7.5);
a <- data.frame(key,sales,year,ovar)
In the resultant data.frame, I am expecting incremental sales rather than real sales. Obviously, we will lose 3 data points for 3 key; one for each starting year, as we are taking difference. So there will be three less rows in the resultant data.frame, which would have columns key,diff(sales),year, and ovar.
This is what I would have done:
a$diffsales <- ave( a$sales, a$key, FUN=function(x) c(NA, diff(x) ) )
a
key sales year ovar diffsales
1 1 12 2002 6.845177 NA
2 1 12 2003 6.328153 0
3 1 15 2004 6.872669 3
4 1 8 2005 6.098920 -7
5 2 3 2001 7.154824 NA
6 2 6 2002 6.110810 3
7 2 3 2003 5.906624 -3
8 2 9 2004 5.214369 6
9 2 9 2005 5.818218 0
10 3 12 2003 5.354354 NA
11 3 3 2004 6.728992 -9
12 3 7 2005 7.412213 4
I appreciate the attempt to display what you'd tried. Thank you.
In the future, try to provide a small example, like this:
df <- data.frame(year = 2001:2010,
sale = sample(20,10))
df <- rbind(df,df,df)
df$key <- rep(letters[1:3],each = 10)
That makes it much clearer what your data look like, and it makes it very easy for people trying to answer. The easier you make it for us, the faster+better answers you'll get.
I'd recommend sorting before splitting:
#Sort first (already sorted, but you get the idea)
df <- df[order(df$key,df$year),]
df_split <- split(df,df$key)
You don't actually want to use sapply. (Try it and see.) You just want lapply:
out <- lapply(df_split,function(x) {x$sale_diff <- c(NA,diff(x$sale)); x[-1,]})
You'd put it all together again using:
do.call(rbind,out)
You're right, plyr or data.table could also do this. I'll leave those examples to others.
Using data.table:
library(data.table)
dt = data.table(a)
dt[, sale_diff := c(NA, diff(sales)), by = key]
dt
# key sales year ovar sale_diff
# 1: 1 12 2002 7.416857 NA
# 2: 1 12 2003 5.625818 0
# 3: 1 15 2004 5.018934 3
# 4: 1 8 2005 6.671986 -7
# 5: 2 3 2001 6.242739 NA
# 6: 2 6 2002 6.297763 3
# 7: 2 3 2003 6.482124 -3
# 8: 2 9 2004 6.724256 6
# 9: 2 9 2005 5.071265 0
#10: 3 12 2003 6.136681 NA
#11: 3 3 2004 6.974392 -9
#12: 3 7 2005 6.517553 4

How to perform wilcox test in R?

I have this data frame with 4 genes and 3 samples measured in duplicate.
The TS is the standard.
I want to perform the wilcox test between the sample S1 with TS and S2 with the TS for each protein, but i´m having problems with the for cycle.
MS.rawMV <- read.table("C:/Users/aaa/Desktop/genomic/MS.csv", header=T)
S1_1 S1_2 S2_1 S2_2 TS_1 TS_2
gene 1 1 1 2 3 5 5
gene 2 10 10 4 5 9 10
gene 3 5 6 4 4 5 7
gene 4 9 9 8 7 6 6
Samples=list(
S1=grep("S1_*", colnames(MS.rawMV), value=TRUE),
S2=grep("S2_*", colnames(MS.rawMV), value=TRUE),
TS=grep("TS_*", colnames(MS.rawMV), value=TRUE))
sample.names <- names(Samples)
ref.sample <- "TS_"
# Build a data.frame
GRates <- data.frame(MS.rawMV[Reduce("c", Samples)])
## Statistics: non parametric test using TS as a standart
for (i in names(Samples)) {
WILCOXTEST <- wilcox.test(GRates[c(Samples[[i]])],Samples[[ref.sample]])
pnames <- paste(i,".wilcoxtest",sep="")
GRates[pnames] <- WILCOXTEST["p.value"]
}
Error in wilcox.test.default(GRates[Samples[[i]]], Samples[[ref.sample[i]]]) :
'x' must be numeric
It looks like the data is being treated as a factor.
The easiest fix would be to convert them back to numeric via factor->character->numeric.
try this
wilcox.test(
as.numeric(as.character(GRates[c(Samples[[i]])])),
as.numeric(as.character(Samples[[ref.sample]]))
)
If you try to convert straight to numeric from factor, you'll end up with integers that represent the factor classes instead of the actual values.
#DWin's comment is well taken (you have additional structure in your data that is hard to incorporate into a Wilcoxon test). However, if you want to ignore the distinction between the _1 and _2 columns and run Wilcoxon test on S1 vs TS and S2 vs TS, here's a way to rearrange the data and do it:
dat <- read.table(text="
gene S1_1 S1_2 S2_1 S2_2 TS_1 TS_2
1 1 1 2 3 5 5
2 10 10 4 5 9 10
3 5 6 4 4 5 7
4 9 9 8 7 6 6",
header=TRUE)
library(reshape2)
library(plyr)
m1 <- melt(dat,id.var="gene")
## break var_num into separate components
m2 <- subset(data.frame(m1,
colsplit(m1$variable,"_",names=c("var","num"))),
select=-variable)
## combine treatments with standards
m3 <- merge(subset(m2,var!="TS"),
subset(m2,var=="TS"),by=c("gene","num"))
## clean up
m4 <- subset(rename(m3,c(value.x="value",var.x="var",value.y="standard")),
select=-var.y)
## apply Wilcoxon test to each component, save the p value
ddply(m4,"var",
function(x) with(x,wilcox.test(value,standard))$p.value)
Or, if you want to test each replication separately (as in #agstudy's answer), do
ddply(m4,c("var","num"),
function(x) with(x,wilcox.test(value,standard))$p.value)
instead.
I think , since wilcox.test is not vectorized you need 2 loops. Even I am not sure Of the statistical meaning of this , here how you can do :
nn <- colnames(dat)
lapply(1:2,function(x){
col.L <- grep(paste0('S',x,'_*'),nn)
col.R <- dat[,paste0('TS_',x)]
lapply(col.L,function(y)
wilcox.test(dat[,y],col.R)['p.value'])
})
Here I assume dat as
dat <- read.table(text='S1_1 S1_2 S2_1 S2_2 TS_1 TS_2
gene_1 1 1 2 3 5 5
gene_2 10 10 4 5 9 10
gene_3 5 6 4 4 5 7
gene_4 9 9 8 7 6 6',header=TRUE)

Resources