Related
I try to create a simple function how to sum some variables in a nested data set.
Here is a much simpler example
df <- data.frame(ID=c(1,1,1,1,2,3,3,4,4,4,5,6,7,7,7,7,7,7,7,7),
var=c("A","B","C","D","B","A","D","A","C","D","D","D","A","D","A","A","A","B","B","B"),
N=c(50,50,50,50,298,156,156,85,85,85,278,301,98,98,98,98,98,98,98,98))
Think of this as a dataframe containing results of 7 different studies. Each study has investigated one or more Variables (A, B, C, D). The variables mean
ID = The ID of a respective study.
var = The respective variable measured in each study. Some studies have measured only one variable (e.g., ID=2, which only contained b), some several
N = The sample size of each study. That is, each ID has a sample size
I would like to create a function that summarizes three things:
k = how many studies measured each variable (e.g., "A")
m = how often each variable was measured (regardless whether some studies measured a variable more than once)--a simple frequency.
N = the sample size per variable--but only once per study. That is, no duplications per study ID are allowed.
My current version (I am a real noob, so please forgive the form), results in exactly what I want:
model km N
1 A 4 (7) 389
2 B 3 (5) 446
3 C 2 (2) 135
4 D 6 (6) 968
For instance, variable A was measured 7times, but only by 4 studies (i.e., study #7 measured it several times. The (non-redundant) sample size was N=389 (not counting the several measures of study #7 more than one time).
(Note: The parentheses in the table are helpful as I intend to copy the results into a document)
Here is the current version of the code. The problems begin with the part containing the pipes
kmn <- function(data, x, ID, N) {
m <-table(data[[x]])
k <-apply(table(data[[x]],data[[ID]]), 1, function(x) length(x[x>0]) )
model <- levels(data[[x]])
km <- cbind(k,m)
colnames(km)<-c("k","m")
km <- paste0(k," (",m,")")
smpsize <- data %>%
group_by(data[[x]]) %>%
summarise(N = sum(N[!duplicated(ID)])) %>%
select(N)
cbind(model,km,smpsize)
}
kmn(data=df, x="var", ID = "ID", N="N")
The above code works but only if the df-dataframe really contains the N-variable (but not with a different variable name). I guess the "data %>%" prompts R to look into the dataframe and not to use the "sum(N..." part as reference to the call.
I can guess that this looks horrible for someone with some idea :)
Thank you for any ideas
Holger
First, remove duplicates by using the unique function and sum by var.
Secondly take df and group by var, n() gives the count and n_distinct(ID) the number of unique IDs, then you join the dataframe stats_N
library(dplyr)
stats_N <- df %>%
select(ID,var,N) %>%
unique() %>%
group_by(var) %>%
summarise(N=sum(N))
df %>%
group_by(var) %>%
summarise(n=n(),km=n_distinct(ID)) %>%
left_join(stats_N)
# A tibble: 4 x 4
# var n km N
# <fct> <int> <int> <dbl>
#1 A 7 4 389
#2 B 5 3 446
#3 C 2 2 135
#4 D 6 6 968
in addition to the #fmarm's answer, it can be also done without a join, where do the group by 'var', get the number of distinct elements in 'D' (n_distinct), number of rows (n()) and the sum of non-duplicated 'N's
library(dplyr)
df %>%
group_by(model = var) %>%
summarise(km = sprintf("%d (%d)", n_distinct(ID), n()),
N = sum(N[!duplicated(N)]))
# A tibble: 4 x 3
# model km N
# <fct> <chr> <dbl>
#1 A 4 (7) 389
#2 B 3 (5) 446
#3 C 2 (2) 135
#4 D 6 (6) 968
I have a four column dataframe with date, var1_share, var2_share, and total. I want to multiply each of the share metrics against the total only to create new variables containing the raw values for both var1 & var2. See below code (a bit verbose) to construct the dataframe that contains the share variables:
df<- data.frame(dt= seq.Date(from = as.Date('2019-01-01'),
to= as.Date('2019-01-10'), by= 'day'),
var1= round(runif(10, 3, 12), digits = 1),
var2= round(runif(10, 3, 12), digits = 1))
df$total<- apply(df[2:3], 1, sum)
ratio<- lapply(df[-1], function(x) x/df$total)
ratio<- data.frame(ratio)
df<- cbind.data.frame(df[1],ratio)
colnames(df)<- c('date', 'var1_share', 'var2_share', 'total')
df
The final dataframe should look like this:
> df
date var1_share var2_share total
1 2019-01-01 0.5862069 0.4137931 1
2 2019-01-02 0.6461538 0.3538462 1
3 2019-01-03 0.3591549 0.6408451 1
4 2019-01-04 0.7581699 0.2418301 1
5 2019-01-05 0.3989071 0.6010929 1
6 2019-01-06 0.5132743 0.4867257 1
7 2019-01-07 0.5230769 0.4769231 1
8 2019-01-08 0.4969325 0.5030675 1
9 2019-01-09 0.5034965 0.4965035 1
10 2019-01-10 0.3254438 0.6745562 1
I have nested an if statement within a for loop, hoping to return a new dataframe called share. I want it to skip date when using the share variables for I've incorporated is.numeric so that it ignores date, however, when I run it, it only returns the date and not the desired result of date, the share of each variable (as separate columns), and the total column. See below code:
for (i in df){
share<- if(is.numeric(i)){
i * df$total
} else i
share<- data.frame(share)
return(share)
}
share
> share
share
1 2019-01-01
2 2019-01-02
3 2019-01-03
...
How do I adjust this function so that share returns a dataframe containing date, variable 1 and 2 raw variables, and total?
One could note that multiplying a vector (*) with a data.frame, will cast the multiplication column wise over the data frame (multiply the vector on column 1, 2, 3 etc.). As such you can do this without any 'apply' by simply using * of the total column and the columns you want to multiply.
Or you could make a simple function to achieve the result. Below is such an example.
Multi_share <- function(x, total_col = "total"){
if(is.character(total_col))
return(x[,sapply(x, is.numeric)[names(x) != total_col]] * x[, total_col])
if(is.numeric(total_col) && NROW(total_col) == NROW(x))
return(x[,sapply(x, is.numeric)] * total_col)
stop("Total unrecognized. Must either be a 1 dimensional vector, a column matrix or a character specifying the total column in R.")
}
cbind(df, Multi_share(df))
One could change the names of the columns as well.
Maybe you want something like this?
share <-df[, sapply(df,is.numeric)]
share <-mapply(function(x) x*share$total, share[,names(share)!="total"])
The first line will give you back only numeric columns (so date is filtered).
The second one will multiply each column (except total) and total.
I'm using R right now where I'm scaling the original data, removing all outliers with a Z-Score of 3 or more, and then filtering out the unscaled data so that it contains only non-outliers. I want to be left with a data frame that contains non-scaled numbers after removing outliers. These were my steps:
Steps
1. Create two data frames (x, y) of the same data
2. Scale x and leave y unscaled.
3. Filter out all rows that have greater than 3 Z-Score in x
4. Currently, for example, x may have 95,000 rows while y still has 100,000
5. Truncate y based on a unique column called Row ID, which I made sure was unscaled in x. This unique column will help me match up the remaining rows in x and the rows in y.
6. y should now have the same number of rows as x, but with the data unscaled. x has the scaled data.
At the moment I can't get the data to be unscaled. I tried using the unscale method or data frame comparison tools but R complains I cannot work on data frames of two different sizes. Is there a workaround to this?
Tries
I've tried dataFrame <- dataFrame[dataFrame$Row %in% remainingRows] but that left nothing in my data frame.
I would also provide data, but it has sensitive information, so any data frame will do so long as it has a unique row ID that won't change during scaling.
If I understood correctly what you want to do, I'm suggesting a different approach. You could use two data.frames for that, but if you use the dplyrpackage, you can do everything within a single line of code ... and presumably faster as well.
First I'm generating a data.frame with 100k rows, which has an ID column (just 1:100000 sequence) and a value (random numbers).
Here's the code:
library(dplyr)
#generate data
x <- data.frame(ID=1:100000,value=runif(100000,max=100)*runif(10000,max=100))
#take a look
> head(x)
ID value
1 1 853.67941
2 2 632.17472
3 3 3089.60716
4 4 8448.89408
5 5 5307.75684
6 6 19.07485
To filter out the outliers, I'm using a dplyr pipe, where I chain multiple operations together with the pipe (%>%) operator. First calculate the zscore, then filter the observations with a zscore bigger than three, and finally drop the zscore column again to go back to your original format (of course you can keep it as well):
xclean <- x %>% mutate(zscore=(value-mean(value)) / sd(value)) %>%
filter(zscore < 3) %>% select(-matches('zscore'))
If you look at the rows, you'll see that the filtering worked
> cat('Rows of X:',nrow(x),'- Rows of xclean:',nrow(xclean))
Rows of X: 100000 - Rows of xclean: 99575
while the data looks like the original data.frame:
> head(xclean)
ID value
1 1 853.67941
2 2 632.17472
3 3 3089.60716
4 4 8448.89408
5 5 5307.75684
6 6 19.07485
Finally, you can see that observations have been filtered out by comparing the IDs of the two data.frames:
> head(x$ID[!is.element(x$ID,xclean$ID)],50)
[1] 68 90 327 467 750 957 1090 1584 1978 2106 2306 3415 3511 3801 3855 4051
[17] 4148 4244 4266 4511 4875 5262 5633 5944 5975 6116 6263 6631 6734 6773 7320 7577
[33] 7619 7731 7735 7889 8073 8141 8207 8966 9200 9369 9994 10123 10538 11046 11090 11183
[49] 11348 11371
EDIT:
Of course, the 2 data frames version is also possible:
y <- x
# calculate zscore
x$value <- (x$value - mean(x$value))/sd(x$value)
#subset y
y <- y[x$value<3,]
# initially 100k rows
> nrow(y)
[1] 99623
Edit2:
Accounting for multiple value columns:
#generate data
set.seed(21)
x <- data.frame(ID=1:100000,value1=runif(100000,max=100)*runif(10000,max=100),
value2=runif(100000,max=100)*runif(10000,max=100),
value3=runif(100000,max=100)*runif(10000,max=100))
> head(x)
ID value1 value2 value3
1 1 2103.9228 5861.33650 713.885222
2 2 341.8342 3940.68674 578.072141
3 3 5346.2175 458.07089 1.577347
4 4 400.1950 5881.05129 3090.618355
5 5 7346.3321 4890.56501 8989.248186
6 6 5305.5105 38.93093 517.509465
The dplyr solution:
# make sure you got a recent version of dplyr
> packageVersion('dplyr')
[1] ‘0.7.2’
# define zscore function:
zscore <- function(x){(x-mean(x))/sd(x)}
# select variables (could also be manually with c())
vars_to_process <- grep('value',colnames(x),value=T)
# calculate zscores and filter
xclean <- x %>% mutate_at(.vars=vars_to_process, .funs=funs(ZS = zscore(.))) %>%
filter_at(vars(matches('ZS')),all_vars(.<3)) %>%
select(-matches('ZS'))
> nrow(xclean)
[1] 98832
Now the solution without dplyr (instead of using 2 dataframes, I'll generate a boolean index based on x:
# select variables
vars_to_process <- grep('value',colnames(x),value=T)
# create index ZS < 3
ix <- apply(x[vars_to_process],2,function(x) (x-mean(x))/sd(x) < 3)
#filter rows
xclean <- x[rowSums(ix) == length(vars_to_process),]
> nrow(xclean)
[1] 98832
I have a dataframe. For simplicity, I am leaving out many columns and rows:
Distance Type
1 162 A
2 27182 A
3 212 C
4 89 B
5 11 C
I need to find 6 consecutive rows in the dataframe, such that the average distance is 1000, and such that the only types considered are A or B. Just for clarification, one may think to filter out all Type C rows, and then proceed, but then the rows that were not originally consecutive will become consecutive upon filtering, and that's no good.
For example, if I filtered out rows 3 and 5 above, I would be left with 3 rows. And if I had provided more rows, that might produce a faulty result.
Maybe a solution with data.table library ?
For reproducibility, here is a data sample, based on what you wrote.
library(data.table)
# data orig (with row numbers...)
DO<-"Distance Type
1 162 A
2 27182 A
3 212 C
4 89 B
5 11 C
6 1234 A"
# data : sep by comma
DS<-gsub('[[:blank:]]+',';',DO)
# data.frame
DF<-read.table(textConnection(DS),header=T,sep=';',stringsAsFactors = F)
#data.table
DT<-as.data.table(DF)
Then, make a function to increment a counter each time a sequence of identical value is found :
# function to set sequencial group number
mkGroupRep<-function(x){
cnt=1L
grp=1L
lx<-length(x)
ne<- x[-lx] != x[-1L] #next not equal
for(i in seq_along(ne)){if(ne[i])cnt=cnt+1;grp[i+1]=cnt}
grp
}
And use it with data.table 'multiple assignment by reference' :
# update dat : set group number based on sequential type
DT[,grp:=mkGroupRep(Type)]
# calc sum of distance and number of item in group, by group
DT[,`:=`(
distMean=mean(Distance),
grpLength=.N
),by=grp]
# filter what you want :
DT[Type != 'C' & distMean >100 & grpLength==2 | grpLength==3]
Output :
Distance Type grp distMean grpLength
1: 162 A 1 13672 2
2: 27182 A 1 13672 2
I am trying to use a huge dataframe (180000 x 400) to calculate another one that would be much smaller.
I have the following dataframe
df1=data.frame(LOCAT=c(1,2,3,4,5,6),START=c(120,345,765,1045,1347,1879),END=c(150,390,802,1120,1436,1935),CODE1=c(1,1,0,1,0,0),CODE2=c(1,0,0,0,-1,-1))
df1
LOCAT START END CODE1 CODE2
1 1 120 150 1 1
2 2 345 390 1 0
3 3 765 802 0 0
4 4 1045 1120 1 0
5 5 1347 1436 0 -1
6 6 1879 1935 0 -1
This is a sample dataframe. The rows continue until 180000 and the columns are over 400.
What I need to do is create a new dataframe based on each column that tells me the size of each continues "1" or "-1" and returns it with the location, size and value.
Something like this for CODE1:
LOCAT SIZE VALUE
1 1 to 2 270 POS
2 4 to 4 75 POS
And like this for CODE2:
LOCAT SIZE VALUE
1 1 to 1 30 POS
2 5 to 6 588 NEG
Unfortunately I still didn't figure out how to do this. I have been trying several lines of code to develop a function to do this automatically but start to get lost or stuck in loops and it seems that nothing works.
Any help would be appreciated.
Thanks in advance
Below is code that gives you the answer in the exact format that you wanted, except I split your "LOCAT" column into two columns entitled "Starts" and "Stops". This code will work for your entire data frame, no need to replicate it manually for each CODE (CODE1, CODE2, etc).
It assumes that the only non-CODE column have the names "LOCAT" "START" and "END".
# need package "plyr"
library("plyr")
# test2 is the example data frame that you gave in the question
test2 <- data.frame(
"LOCAT"=1:6,
"START"=c(120,345,765, 1045, 1347, 1879),
"END"=c(150,390,803,1120,1436, 1935),
"CODE1"=c(1,1,0,1,0,0),
"CODE2"=c(1,0,0,0,-1,-1)
)
codeNames <- names(test2)[!names(test2)%in%c("LOCAT","START","END")] # the names of columns that correspond to different codes
test3 <- reshape(test2, varying=codeNames, direction="long", v.names="CodeValue", timevar="Code") # reshape so the different codes are variables grouped into the same column
test4 <- test3[,!names(test3)%in%"id"] #remove the "id" column
sss <- function(x){ # sss gives the starting points, stopping points, and sizes (sss) in a data frame
rleX <- rle(x[,"CodeValue"]) # rle() to get the size of consecutive values
stops <- cumsum(rleX$lengths) # cumulative sum to get the end-points for the indices (the second value in your LOCAT column)
starts <- c(1, head(stops,-1)+1) # the starts are the first value in your LOCAT column
ssX0 <- data.frame("Value"=rleX$values, "Starts"=starts, "Stops"=stops) #the starts and stops from X (ss from X)
ssX <- ssX0[ssX0[,"Value"]!=0,] # remove the rows the correspond to CODE_ values that are 0 (not POS or NEG)
# The next 3 lines calculate the equivalent of your SIZE column
sizeX1 <- x[ssX[,"Starts"],"START"]
sizeX2 <- x[ssX[,"Stops"],"END"]
sizeX <- sizeX2 - sizeX1
sssX <- data.frame(ssX, "Size"=sizeX) # Combine the Size to the ssX (start stop of X) data frame
return(sssX) #Added in EDIT
}
answer0 <- ddply(.data=test4, .variables="Code", .fun=sss) # use the function ddply() in the package "plyr" (apply the function to each CODE, why we reshaped)
answer <- answer0 # duplicate the original, new version will be reformatted
answer[,"Value"] <- c("NEG",NA,"POS")[answer0[,"Value"]+2] # reformat slightly so that we have POS/NEG instead of 1/-1
Hopefully this helps, good luck!
Use run-length encoding to determine groups where CODE1 takes the same value.
rle_of_CODE1 <- rle(df1$CODE1)
For convenience, find the points where the value is non-zero, and the lenghts of the corresponding blocks.
CODE1_is_nonzero <- rle_of_CODE1$values != 0
n <- rle_of_CODE1$lengths[CODE1_is_nonzero]
Ignore the parts of df1 where CODE1 is zero.
df1_with_nonzero_CODE1 <- subset(df1, CODE1 != 0)
Define a group based on the contiguous blocks we found with rle.
df1_with_nonzero_CODE1$GROUP <- rep(seq_along(n), times = n)
Use ddply to get summary stats for each group.
summarised_by_CODE1 <- ddply(
df1_with_nonzero_CODE1,
.(GROUP),
summarise,
MinOfLOCAT = min(LOCAT),
MaxOfLOCAT = max(LOCAT),
SIZE = max(END) - min(START)
)
summarised_by_CODE1$VALUE <- ifelse(
rle_of_CODE1$values[CODE1_is_nonzero] == 1,
"POS",
"NEG"
)
summarised_by_CODE1
## GROUP MinOfLOCAT MaxOfLOCAT SIZE VALUE
## 1 1 1 2 270 POS
## 2 3 4 4 75 POS
Now repeat with CODE2.