countif within R repeated across each row - r

I'm having trouble trying to replicate some of the countif function I'm familiar with in excel. I've got a data frame, and it has a large number of rows. I'm trying to take 2 variables (x & z) and do a countif of how many other variables within my dataframe match that. I figured out doing:
sum('mydataframe'$x==`mydataframe`$x[1]&`mydataframe'$z==`mydataframe`$z[1])
This gives me the correct countif for x&z within the whole data set for the first row [1]. The problem is I've got to use that [1]. I've tried using the (with,...) command, but then I can no longer access the whole column.
I'd like to be able to do the count of x & z combination for each row within the data frame then have that output as a new vector that I can just add as another column. And I'd like this to go on for every row through to the end.
Hopefully this is pretty simple. I figure some combination of (with,..) or apply or something will do it, but I'm just too new.
I am interested in a count total in every instance, not a running sequential count.

It seems that you are asking for a way to create a new column that contains the number of rows in the entire data frame with x and z value equal to the values of those variables for that row.
With a bit of sample data:
(dat <- data.frame(x=c(1, 1, 2), z=c(3, 3, 3)))
# x z
# 1 1 3
# 2 1 3
# 3 2 3
One simple approach would be grouping with dplyr's group_by function and then creating a new column with the number of elements in that group:
library(dplyr)
dat %>% group_by(x, z) %>% mutate(n=n())
# x z n
# (dbl) (dbl) (int)
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
A base R solution would probably involve ave:
dat$n <- ave(rep(NA, nrow(dat)), dat$x, dat$z, FUN=length)
dat
# x z n
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1

An option using data.table would be to convert the 'data.frame' to 'data.table' (setDT(dat)) , group by 'x', 'z' and
assign 'n' as the number of elements in each group (.N).
library(data.table)
setDT(dat)[, n:= .N, by = .(x,z)]
dat
# x z n
#1: 1 3 2
#2: 1 3 2
#3: 2 3 1

Related

R- How to put multiple cases (rows) in one row

I am doing some coding in R and have encountered an issue.
I have a data set where participants were given the same question(s) a few different times. There is one id variable, a time variable which records which instance we are dealing with, and one outcome variable.
I did a little research and found a post similar to what I am trying to do.
Turning one row into multiple rows in r
I am trying to do the exact opposite thing of what is being done in this post.
I created this small code to give an idea of what I am dealing with.
A1
id time x
1000 1 1
1000 2 2
1000 3 3
1000 4 4
1001 1 1
1001 2 2
1001 3 3
1001 4 4
What I need to do is reorganize the data set so that each case is on one line and I repeat every X variable multiple times (x1 would be the first time point, x2 would be the second time point, etc). Here is a sample code of what I would like the final data frame to look like.
B1
id x1 x2 x3 x4
1000 1 2 3 4
1001 1 2 3 4
There are a few nuances in my code that make this situation really tricking. Some participants have many more x entries than other participants (some participants only have 1 or 2 different x values while others have 7 or 8). There is some missing data as well.
I have approached the problem in a few ways with no luck. I am not sure what the best way is to handle this situation. The attempts I have tried either require a lot of code, usually the same basic code repeating multiple times, or code that doesn't work. Here is what I have tried.
I tried to use a for loop. I tried to create a new variable to identify the participant by id and then identify the first time they are doing the survey, then I use the first x value. I would then repeat this for each time point(for time 2- find the second x value for a given participant, for time 3- find the third x value for a given participant, etc.). As I have currently anywhere from 1 to 10 time points, this involves a lot of for loops. Because some people don't have an 6 or 7th time, the code often doesn't run. Here is an example of the for loop I have tried.
for (i in A1$id) {
temp.txt<- paste (
c ("A1$x1 [A$id ==", i," & A$time == 1] <- A1$x"
), collapse = "")
eval (parse(text = temp.txt))
}
I tried to subset the data for each time point, then merge the data together at the end. If I try this, I have missing data, and I also encounter issues with variables names no longer being accepted (I think because the names are similar, R has an issue with renaming everything). Here is an example of what that code looks like.
t1 <- subset (A1, A$time == 1)
t2 <- subset (A1, A$time == 2)
t3 <- subset (A1, A$time == 3)
t4 <- subset (A1, A$time == 4)
Z1 <- merge (t1, t2, by = "id")
Z2 <- merge (Z1, t3, by = "id")
Z3 <- merge (Z2, t4, by = "id")
Is there a different/easier way to approach this issue? Thanks, I really appreciate it.
1) reshape This is referred to as converting long form to wide form. In base R we can use reshape giving the following data frame. Note that reshape assumes that if there are columns named id and time then those are the id and time columns but if they had been named something else we would have had to specify them using the appropriate reshape arguments.
reshape(DF, dir = "wide")
## id x.1 x.2 x.3 x.4
## 1 1000 1 2 3 4
## 5 1001 1 2 3 4
2) xtabs Another base R solution is to use xtabs which gives the following table object:
xtabs(x ~ ., DF)
## time
## id 1 2 3 4
## 1000 1 2 3 4
## 1001 1 2 3 4
3) tapply or tapply which gives this matrix:
with(DF, tapply(x, list(id, time), c))
## 1 2 3 4
## 1000 1 2 3 4
## 1001 1 2 3 4
4) pivot wider The tidyr package has pivot_wider to do this:
library(tidyr)
pivot_wider(DF, names_from = "time", values_from = x)
## # A tibble: 2 x 5
## id `1` `2` `3` `4`
## <int> <int> <int> <int> <int>
## 1 1000 1 2 3 4
## 2 1001 1 2 3 4
Note
The input in reproducible form:
Lines <- "id time x
1000 1 1
1000 2 2
1000 3 3
1000 4 4
1001 1 1
1001 2 2
1001 3 3
1001 4 4"
DF <- read.table(text = Lines, header = TRUE)
Using data.table you can try
library(data.table)
setDT(A1) #Converting into data.table
result <- dcast(A1, id~x, value.var= "time") #long to wide conversion
names(result)[-1]<- paste0("x.",names(result)[-1]) #setting the names accordingly
result #your result
id x.1 x.2 x.3 x.4
1: 1000 1 2 3 4
2: 1001 1 2 3 4

Aggregate in R taking way too long

I'm trying to count the unique values of x across groups y.
This is the function:
aggregate(x~y,z[which(z$grp==0),],function(x) length(unique(x)))
This is taking way too long (~6 hours and not done yet). I don't want to stop processing as I have to finish this tonight.
by() was taking too long as well
Any ideas what is going wrong and how I can reduce the processing time ~ 1 hour?
My dataset has 3 million rows and 16 columns.
Input dataframe z
x y grp
1 1 0
2 1 0
1 2 1
1 3 0
3 4 1
I want to get the count of unique (x) for each y where grp = 0
UPDATE: Using #eddi's excellent answer. I have
x y
1: 2 1
2: 1 3
Any idea how I can quickly summarize this as the number of x's for each value y?
So for this it will be
Number of x y
5 1
1 3
Here you go:
library(data.table)
setDT(z) # to convert to data.table in place
z[grp == 0, uniqueN(x), by = y]
# y V1
#1: 1 2
#2: 3 1
library(dplyr)
z %>%
filter(grp == 0) %>%
group_by(y) %>%
summarize(nx = n_distinct(x)))
is the dplyr way, though it may not be as fast as data.table.

Using Table on data frame by mutliple variables

I have a data table that is in a "long" format, containing many entries for each unique ID. For example...
id <- c(1,1,1,2,2,2)
date <- c("A","A","B","C","C","C")
loc <- c("X", "X", "X", "X","Y","Z")
dfTest <- data.frame(id,date,loc)
Which creates a sample table.
id date loc
1 1 A X
2 1 A X
3 1 B X
4 2 C X
5 2 C Y
6 2 C Z
My goal is to create a table that looks like this.
id X Y Z
1 2 0 0
2 1 1 1
I would like to see how many times a location was visited uniquely. ID#1 visited X on day A and day B, giving a total unique visits of 2. I approached this using reshape, thinking to turn this into a "wide" format. However, I don't know how to factor in the second variable (the date). I'm trying to pull out the number of visits to each location on unique dates. The actual date itself otherwise does not matter, just that it identify the duplicate entries.
My current solution would be poor form in R (to use iterative loops to look at locations found within each unique date). I was hoping reshape, apply, aggregate, or perhaps another package may be of more help. I've looked through a bunch of other reshape guides, but am still a bit stuck on the clever way to do this.
By the sounds of it, you should be able to do what you need with:
table(unique(dfTest)[-2])
## loc
## id X Y Z
## 1 2 0 0
## 2 1 1 1
We can group by 'loc', 'id', get the length of unique elements of 'date' and use dcast to get the expected output.
library(data.table)#v1.9.6+
dcast(setDT(dfTest)[, uniqueN(date), .(loc, id)], id~loc, value.var='V1', fill=0)
# id X Y Z
#1: 1 2 0 0
#2: 2 1 1 1

keep values of a data frame column R

In my data frame df I want to get the id number satisfying the condition that the value of A is greater than the value of B. In the example I only would want Id=2.
Id Name Value
1 A 3
1 B 5
1 C 4
2 A 7
2 B 6
2 C 8
vecA<-vector();
vecB<-vector();
vecId<-vector();
i<-1
while(i<=length(dim(df)[1]){
if(df$Name[[i]]=="A"){vecA<-c(vecA,df$Value)}
if(df$Name[[i]]=="B"){vecB<-c(vecB,df$Value)}
if(vecA[i]>vecB[i]){vecId<-c(vecId,)}
i<-i+1
}
First, you could convert your data from long to wide so you have one row for each ID:
library(reshape2)
(wide <- dcast(df, Id~Name, value.var="Value"))
# Id A B C
# 1 1 3 5 4
# 2 2 7 6 8
Now you can use normal indexing to get the ids with larger A than B:
wide$Id[wide$A > wide$B]
# [1] 2
The first answer works out well for sure. I wanted to get to regular subset operations as well. I came up with this since you might want to check out some of the more recent R packages. If you had 3 groups to compare that would be interesting. Oh in the code below exp is the exact data.frame you started with.
library(plyr)
library(dplyr)
comp <- exp %>% filter(Name %in% c("A","B")) %>% group_by(Id) %>% filter(min_rank(Value)>1)
# If the whole row is needed
comp[which.max(comp$Value),]
# If not
comp[which.max(comp$Value),"Id"]

aggregate over several variables in r

I have a rather large dataset in a long format where I need to count the number of instances of the ID due to two different variables, A & B. E.g. The same person can be represented in multiple rows due to either A or B. What I need to do is to count the number of instances of ID which is not too hard, but also count the number of ID due to A and B and return these as variables in the dataset.
Regards,
//Mi
The ddply() function from the package plyr lets you break data apart by identifier variables, perform a function on each chunk, and then assemble it all back together. So you need to break your data apart by identifier and A/B status, count how many times each of those combinations occur (using nrow()), and then put those counts back together nicely.
Using wkmor1's df:
library(plyr)
x <- ddply(.data = df, .var = c("ID", "GRP"), .fun = nrow)
which returns:
ID GRP V1
1 1 a 2
2 1 b 2
3 2 a 2
4 2 b 2
And then merge that back on to the original data:
merge(x, df, by = c("ID", "GRP"))
OK, given the interpretations I see, then the fastest and easiest solution is...
df$IDCount <- ave(df$ID, df$group, FUN = length)
Here is one approach using 'table' to count rows meeting your criteria, and 'merge' to add the frequencies back to the data frame.
> df<-data.frame(ID=rep(c(1,2),4),GRP=rep(c("a","a","b","b"),2))
> id.frq <- as.data.frame(table(df$ID))
> colnames(id.frq) <- c('ID','ID.FREQ')
> df <- merge(df,id.frq)
> grp.frq <- as.data.frame(table(df$ID,df$GRP))
> colnames(grp.frq) <- c('ID','GRP','GRP.FREQ')
> df <- merge(df,grp.frq)
> df
ID GRP ID.FREQ GRP.FREQ
1 1 a 4 2
2 1 a 4 2
3 1 b 4 2
4 1 b 4 2
5 2 a 4 2
6 2 a 4 2
7 2 b 4 2
8 2 b 4 2

Resources