aggregate over several variables in r - r

I have a rather large dataset in a long format where I need to count the number of instances of the ID due to two different variables, A & B. E.g. The same person can be represented in multiple rows due to either A or B. What I need to do is to count the number of instances of ID which is not too hard, but also count the number of ID due to A and B and return these as variables in the dataset.
Regards,
//Mi

The ddply() function from the package plyr lets you break data apart by identifier variables, perform a function on each chunk, and then assemble it all back together. So you need to break your data apart by identifier and A/B status, count how many times each of those combinations occur (using nrow()), and then put those counts back together nicely.
Using wkmor1's df:
library(plyr)
x <- ddply(.data = df, .var = c("ID", "GRP"), .fun = nrow)
which returns:
ID GRP V1
1 1 a 2
2 1 b 2
3 2 a 2
4 2 b 2
And then merge that back on to the original data:
merge(x, df, by = c("ID", "GRP"))

OK, given the interpretations I see, then the fastest and easiest solution is...
df$IDCount <- ave(df$ID, df$group, FUN = length)

Here is one approach using 'table' to count rows meeting your criteria, and 'merge' to add the frequencies back to the data frame.
> df<-data.frame(ID=rep(c(1,2),4),GRP=rep(c("a","a","b","b"),2))
> id.frq <- as.data.frame(table(df$ID))
> colnames(id.frq) <- c('ID','ID.FREQ')
> df <- merge(df,id.frq)
> grp.frq <- as.data.frame(table(df$ID,df$GRP))
> colnames(grp.frq) <- c('ID','GRP','GRP.FREQ')
> df <- merge(df,grp.frq)
> df
ID GRP ID.FREQ GRP.FREQ
1 1 a 4 2
2 1 a 4 2
3 1 b 4 2
4 1 b 4 2
5 2 a 4 2
6 2 a 4 2
7 2 b 4 2
8 2 b 4 2

Related

R- How to put multiple cases (rows) in one row

I am doing some coding in R and have encountered an issue.
I have a data set where participants were given the same question(s) a few different times. There is one id variable, a time variable which records which instance we are dealing with, and one outcome variable.
I did a little research and found a post similar to what I am trying to do.
Turning one row into multiple rows in r
I am trying to do the exact opposite thing of what is being done in this post.
I created this small code to give an idea of what I am dealing with.
A1
id time x
1000 1 1
1000 2 2
1000 3 3
1000 4 4
1001 1 1
1001 2 2
1001 3 3
1001 4 4
What I need to do is reorganize the data set so that each case is on one line and I repeat every X variable multiple times (x1 would be the first time point, x2 would be the second time point, etc). Here is a sample code of what I would like the final data frame to look like.
B1
id x1 x2 x3 x4
1000 1 2 3 4
1001 1 2 3 4
There are a few nuances in my code that make this situation really tricking. Some participants have many more x entries than other participants (some participants only have 1 or 2 different x values while others have 7 or 8). There is some missing data as well.
I have approached the problem in a few ways with no luck. I am not sure what the best way is to handle this situation. The attempts I have tried either require a lot of code, usually the same basic code repeating multiple times, or code that doesn't work. Here is what I have tried.
I tried to use a for loop. I tried to create a new variable to identify the participant by id and then identify the first time they are doing the survey, then I use the first x value. I would then repeat this for each time point(for time 2- find the second x value for a given participant, for time 3- find the third x value for a given participant, etc.). As I have currently anywhere from 1 to 10 time points, this involves a lot of for loops. Because some people don't have an 6 or 7th time, the code often doesn't run. Here is an example of the for loop I have tried.
for (i in A1$id) {
temp.txt<- paste (
c ("A1$x1 [A$id ==", i," & A$time == 1] <- A1$x"
), collapse = "")
eval (parse(text = temp.txt))
}
I tried to subset the data for each time point, then merge the data together at the end. If I try this, I have missing data, and I also encounter issues with variables names no longer being accepted (I think because the names are similar, R has an issue with renaming everything). Here is an example of what that code looks like.
t1 <- subset (A1, A$time == 1)
t2 <- subset (A1, A$time == 2)
t3 <- subset (A1, A$time == 3)
t4 <- subset (A1, A$time == 4)
Z1 <- merge (t1, t2, by = "id")
Z2 <- merge (Z1, t3, by = "id")
Z3 <- merge (Z2, t4, by = "id")
Is there a different/easier way to approach this issue? Thanks, I really appreciate it.
1) reshape This is referred to as converting long form to wide form. In base R we can use reshape giving the following data frame. Note that reshape assumes that if there are columns named id and time then those are the id and time columns but if they had been named something else we would have had to specify them using the appropriate reshape arguments.
reshape(DF, dir = "wide")
## id x.1 x.2 x.3 x.4
## 1 1000 1 2 3 4
## 5 1001 1 2 3 4
2) xtabs Another base R solution is to use xtabs which gives the following table object:
xtabs(x ~ ., DF)
## time
## id 1 2 3 4
## 1000 1 2 3 4
## 1001 1 2 3 4
3) tapply or tapply which gives this matrix:
with(DF, tapply(x, list(id, time), c))
## 1 2 3 4
## 1000 1 2 3 4
## 1001 1 2 3 4
4) pivot wider The tidyr package has pivot_wider to do this:
library(tidyr)
pivot_wider(DF, names_from = "time", values_from = x)
## # A tibble: 2 x 5
## id `1` `2` `3` `4`
## <int> <int> <int> <int> <int>
## 1 1000 1 2 3 4
## 2 1001 1 2 3 4
Note
The input in reproducible form:
Lines <- "id time x
1000 1 1
1000 2 2
1000 3 3
1000 4 4
1001 1 1
1001 2 2
1001 3 3
1001 4 4"
DF <- read.table(text = Lines, header = TRUE)
Using data.table you can try
library(data.table)
setDT(A1) #Converting into data.table
result <- dcast(A1, id~x, value.var= "time") #long to wide conversion
names(result)[-1]<- paste0("x.",names(result)[-1]) #setting the names accordingly
result #your result
id x.1 x.2 x.3 x.4
1: 1000 1 2 3 4
2: 1001 1 2 3 4

countif within R repeated across each row

I'm having trouble trying to replicate some of the countif function I'm familiar with in excel. I've got a data frame, and it has a large number of rows. I'm trying to take 2 variables (x & z) and do a countif of how many other variables within my dataframe match that. I figured out doing:
sum('mydataframe'$x==`mydataframe`$x[1]&`mydataframe'$z==`mydataframe`$z[1])
This gives me the correct countif for x&z within the whole data set for the first row [1]. The problem is I've got to use that [1]. I've tried using the (with,...) command, but then I can no longer access the whole column.
I'd like to be able to do the count of x & z combination for each row within the data frame then have that output as a new vector that I can just add as another column. And I'd like this to go on for every row through to the end.
Hopefully this is pretty simple. I figure some combination of (with,..) or apply or something will do it, but I'm just too new.
I am interested in a count total in every instance, not a running sequential count.
It seems that you are asking for a way to create a new column that contains the number of rows in the entire data frame with x and z value equal to the values of those variables for that row.
With a bit of sample data:
(dat <- data.frame(x=c(1, 1, 2), z=c(3, 3, 3)))
# x z
# 1 1 3
# 2 1 3
# 3 2 3
One simple approach would be grouping with dplyr's group_by function and then creating a new column with the number of elements in that group:
library(dplyr)
dat %>% group_by(x, z) %>% mutate(n=n())
# x z n
# (dbl) (dbl) (int)
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
A base R solution would probably involve ave:
dat$n <- ave(rep(NA, nrow(dat)), dat$x, dat$z, FUN=length)
dat
# x z n
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
An option using data.table would be to convert the 'data.frame' to 'data.table' (setDT(dat)) , group by 'x', 'z' and
assign 'n' as the number of elements in each group (.N).
library(data.table)
setDT(dat)[, n:= .N, by = .(x,z)]
dat
# x z n
#1: 1 3 2
#2: 1 3 2
#3: 2 3 1

keep values of a data frame column R

In my data frame df I want to get the id number satisfying the condition that the value of A is greater than the value of B. In the example I only would want Id=2.
Id Name Value
1 A 3
1 B 5
1 C 4
2 A 7
2 B 6
2 C 8
vecA<-vector();
vecB<-vector();
vecId<-vector();
i<-1
while(i<=length(dim(df)[1]){
if(df$Name[[i]]=="A"){vecA<-c(vecA,df$Value)}
if(df$Name[[i]]=="B"){vecB<-c(vecB,df$Value)}
if(vecA[i]>vecB[i]){vecId<-c(vecId,)}
i<-i+1
}
First, you could convert your data from long to wide so you have one row for each ID:
library(reshape2)
(wide <- dcast(df, Id~Name, value.var="Value"))
# Id A B C
# 1 1 3 5 4
# 2 2 7 6 8
Now you can use normal indexing to get the ids with larger A than B:
wide$Id[wide$A > wide$B]
# [1] 2
The first answer works out well for sure. I wanted to get to regular subset operations as well. I came up with this since you might want to check out some of the more recent R packages. If you had 3 groups to compare that would be interesting. Oh in the code below exp is the exact data.frame you started with.
library(plyr)
library(dplyr)
comp <- exp %>% filter(Name %in% c("A","B")) %>% group_by(Id) %>% filter(min_rank(Value)>1)
# If the whole row is needed
comp[which.max(comp$Value),]
# If not
comp[which.max(comp$Value),"Id"]

R - merge by variable column with duplicated entry

I am trying to merge two data of different size by ID. However, for the values that match, both data contain duplicated entries, i.e., there may be three ID #3 in Data A and three ID#3 in Data B. When I try to merge the data, the result is much larger than both data combined.
C<-merge(A,B,by="ID",all.x=T,sort=F)
I want to merge the two data by the ID column, such that the first ID #3 in B pairs with the first ID #3 in A, and so on.
Also, I want the row order of Data A to remain the same. The sort=FALSE wasn't much helpful: It places all the matching rows at the top, and the unmatched rows at the bottom.
Thanks for your help!
Before merging, you'll need to add to each data.frame a column whose value records the index of each observation within its own ID group.
## Example data
A <- data.frame(ID=c(1,1,1,2), ht=1:4)
B <- data.frame(ID=c(1,1,2,2), wt=3:6)
## Add column with number of each observation within ID
A <- transform(A, ID2=ave(ID, ID, FUN=seq_along))
B <- transform(B, ID2=ave(ID, ID, FUN=seq_along))
## Now carry out the merge
merge(A, B, all.x=TRUE, sort=FALSE)
# ID ID2 ht wt
# 1 1 1 1 3
# 2 1 2 2 4
# 3 2 1 4 5
# 4 1 3 3 NA
Thanks for your help, it is really useful. I end up adding a column of numbers to the larger data that I want to preserve order of.
Using #Josh O'Brien's example,
> ## Example data
> A <- data.frame(ID=c(1,1,1,2), ht=1:4)
> B <- data.frame(ID=c(1,1,2,2), wt=3:6)
>
> ## Add column with number of each observation within ID
> A <- transform(A, ID2=ave(ID, ID, FUN=seq_along))
> B <- transform(B, ID2=ave(ID, ID, FUN=seq_along))
>
> # Add a new column in A that numbers the row from 1 to number of row
> A$ORDER_DATA <- 1:nrow(A)
>
> ## Now carry out the merge
> C<-merge(A, B, all.x=TRUE, sort=FALSE)
>
> # Sort the merged data by ORDER_DATA column
> D<-C[with(C,order(ORDER_DATA)),]
> D
ID ID2 ht ORDER_DATA wt
1 1 1 1 1 3
2 1 2 2 2 4
4 1 3 3 3 NA
3 2 1 4 4 5

Multirow deletion: delete row depending on other row

I'm stuck with a quite complex problem. I have a data frame with three rows: id, info and rownum. The data looks like this:
id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8
What I want to do now is to delete all other rows of one id if one of the rows contains the info a. This would mean for example that row 2 and 3 should be removed as row 1's coloumn info contains the value a. Please note that the info values are not ordered (id 3/row 5 & 6) and cannot be ordered due to other data limitations.
I solved the case using a for loop:
# select all id containing an "a"-value
a_val <- data$id[grep("a", data$info)]
# check for every id containing an "a"-value
for(i in a_val) {
temp_data <- data[which(data$id == i),]
# only go on if the given id contains more than one row
if (nrow(temp_data) > 1) {
for (ii in nrow(temp_data)) {
if (temp_data$info[ii] != "a") {
temp <- temp_data$row[ii]
if (!exists("delete_rows")) {
delete_rows <- temp
} else {
delete_rows <- c(delete_rows, temp)
}
}
}
}
}
My solution works quite well. Nevertheless, it is very, very, very slow as the original data contains more than 700k rows and more that 150k rows with an "a"-value.
I could use a foreach loop with 4 cores to speed it up, but maybe someone could give me a hint for a better solution.
Best regards,
Arne
[UPDATE]
The outcome should be:
id info row
1 a 1
2 a 4
3 a 6
4 b 7
4 c 8
Here is one possible solution.
First find ids where info contains "a":
ids <- with(data, unique(id[info == "a"]))
Subset the data:
subset(data, (id %in% ids & info == "a") | !id %in% ids)
Output:
id info row
1 1 a 1
4 2 a 4
6 3 a 6
7 4 b 7
8 4 c 8
An alternative solution (maybe harder to decipher):
subset(data, info == "a" | !rep.int(tapply(info, id, function(x) any(x == "a")),
table(id)))
Note. #BenBarnes found out that this solution only works if the data frame is ordered according to id.
You might want to investigate the data.table package:
EDIT: If the row variable is not a sequential numbering of each row in your data (as I assumed it was), you could create such a variable to obtain the original row order:
library(data.table)
# Create data.table of your data
dt <- as.data.table(data)
# Create index to maintain row order
dt[, idx := seq_len(nrow(dt))]
# Set a key on id and info
setkeyv(dt, c("id", "info"))
# Determine unique ids
uid <- dt[, unique(id)]
# subset your data to select rows with "a"
dt2 <- dt[J(uid, "a"), nomatch = 0]
# identify rows of dataset where the id doesn't have an "a"
dt3 <- dt[J(dt2[, setdiff(uid, id)])]
# rbind those two data.tables together
(dt4 <- rbind(dt2, dt3))
# id info row idx
# 1: 1 a 1 1
# 2: 2 a 4 4
# 3: 3 a 6 6
# 4: 4 b 7 7
# 5: 4 c 8 8
# And if you need the original ordering of rows,
dt5 <- dt4[order(idx)]
Note that setting a key for the data.table will order the rows according to the key columns. The last step (creating dt5) sets the row order back to the original.
Here is a way using ddply:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
library("plyr")
ddply(df,.(id),subset,rep(!'a'%in%info,length(info))|info=='a')
Returns:
id info row
1 1 a 1
2 2 a 4
3 3 a 6
4 4 b 7
5 4 c 8
if df is this (RE Sacha above) use match which just finds the index of the first occurrence:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
# the first info row matching 'a' and all other rows that are not 'a'
with(df, df[c(match('a',info), which(info != 'a')),])
id info row
1 1 a 1
2 1 b 2
3 1 c 3
5 3 b 5
7 4 b 7
8 4 c 8
try to take a look at subset, it's quite easy to use and it will solve your problem.
you just need to specify the value of the column that you want to subset based on, alternatively you can choose more columns.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html
http://www.statmethods.net/management/subset.html

Resources