I have a data table that is in a "long" format, containing many entries for each unique ID. For example...
id <- c(1,1,1,2,2,2)
date <- c("A","A","B","C","C","C")
loc <- c("X", "X", "X", "X","Y","Z")
dfTest <- data.frame(id,date,loc)
Which creates a sample table.
id date loc
1 1 A X
2 1 A X
3 1 B X
4 2 C X
5 2 C Y
6 2 C Z
My goal is to create a table that looks like this.
id X Y Z
1 2 0 0
2 1 1 1
I would like to see how many times a location was visited uniquely. ID#1 visited X on day A and day B, giving a total unique visits of 2. I approached this using reshape, thinking to turn this into a "wide" format. However, I don't know how to factor in the second variable (the date). I'm trying to pull out the number of visits to each location on unique dates. The actual date itself otherwise does not matter, just that it identify the duplicate entries.
My current solution would be poor form in R (to use iterative loops to look at locations found within each unique date). I was hoping reshape, apply, aggregate, or perhaps another package may be of more help. I've looked through a bunch of other reshape guides, but am still a bit stuck on the clever way to do this.
By the sounds of it, you should be able to do what you need with:
table(unique(dfTest)[-2])
## loc
## id X Y Z
## 1 2 0 0
## 2 1 1 1
We can group by 'loc', 'id', get the length of unique elements of 'date' and use dcast to get the expected output.
library(data.table)#v1.9.6+
dcast(setDT(dfTest)[, uniqueN(date), .(loc, id)], id~loc, value.var='V1', fill=0)
# id X Y Z
#1: 1 2 0 0
#2: 2 1 1 1
Related
I am doing some coding in R and have encountered an issue.
I have a data set where participants were given the same question(s) a few different times. There is one id variable, a time variable which records which instance we are dealing with, and one outcome variable.
I did a little research and found a post similar to what I am trying to do.
Turning one row into multiple rows in r
I am trying to do the exact opposite thing of what is being done in this post.
I created this small code to give an idea of what I am dealing with.
A1
id time x
1000 1 1
1000 2 2
1000 3 3
1000 4 4
1001 1 1
1001 2 2
1001 3 3
1001 4 4
What I need to do is reorganize the data set so that each case is on one line and I repeat every X variable multiple times (x1 would be the first time point, x2 would be the second time point, etc). Here is a sample code of what I would like the final data frame to look like.
B1
id x1 x2 x3 x4
1000 1 2 3 4
1001 1 2 3 4
There are a few nuances in my code that make this situation really tricking. Some participants have many more x entries than other participants (some participants only have 1 or 2 different x values while others have 7 or 8). There is some missing data as well.
I have approached the problem in a few ways with no luck. I am not sure what the best way is to handle this situation. The attempts I have tried either require a lot of code, usually the same basic code repeating multiple times, or code that doesn't work. Here is what I have tried.
I tried to use a for loop. I tried to create a new variable to identify the participant by id and then identify the first time they are doing the survey, then I use the first x value. I would then repeat this for each time point(for time 2- find the second x value for a given participant, for time 3- find the third x value for a given participant, etc.). As I have currently anywhere from 1 to 10 time points, this involves a lot of for loops. Because some people don't have an 6 or 7th time, the code often doesn't run. Here is an example of the for loop I have tried.
for (i in A1$id) {
temp.txt<- paste (
c ("A1$x1 [A$id ==", i," & A$time == 1] <- A1$x"
), collapse = "")
eval (parse(text = temp.txt))
}
I tried to subset the data for each time point, then merge the data together at the end. If I try this, I have missing data, and I also encounter issues with variables names no longer being accepted (I think because the names are similar, R has an issue with renaming everything). Here is an example of what that code looks like.
t1 <- subset (A1, A$time == 1)
t2 <- subset (A1, A$time == 2)
t3 <- subset (A1, A$time == 3)
t4 <- subset (A1, A$time == 4)
Z1 <- merge (t1, t2, by = "id")
Z2 <- merge (Z1, t3, by = "id")
Z3 <- merge (Z2, t4, by = "id")
Is there a different/easier way to approach this issue? Thanks, I really appreciate it.
1) reshape This is referred to as converting long form to wide form. In base R we can use reshape giving the following data frame. Note that reshape assumes that if there are columns named id and time then those are the id and time columns but if they had been named something else we would have had to specify them using the appropriate reshape arguments.
reshape(DF, dir = "wide")
## id x.1 x.2 x.3 x.4
## 1 1000 1 2 3 4
## 5 1001 1 2 3 4
2) xtabs Another base R solution is to use xtabs which gives the following table object:
xtabs(x ~ ., DF)
## time
## id 1 2 3 4
## 1000 1 2 3 4
## 1001 1 2 3 4
3) tapply or tapply which gives this matrix:
with(DF, tapply(x, list(id, time), c))
## 1 2 3 4
## 1000 1 2 3 4
## 1001 1 2 3 4
4) pivot wider The tidyr package has pivot_wider to do this:
library(tidyr)
pivot_wider(DF, names_from = "time", values_from = x)
## # A tibble: 2 x 5
## id `1` `2` `3` `4`
## <int> <int> <int> <int> <int>
## 1 1000 1 2 3 4
## 2 1001 1 2 3 4
Note
The input in reproducible form:
Lines <- "id time x
1000 1 1
1000 2 2
1000 3 3
1000 4 4
1001 1 1
1001 2 2
1001 3 3
1001 4 4"
DF <- read.table(text = Lines, header = TRUE)
Using data.table you can try
library(data.table)
setDT(A1) #Converting into data.table
result <- dcast(A1, id~x, value.var= "time") #long to wide conversion
names(result)[-1]<- paste0("x.",names(result)[-1]) #setting the names accordingly
result #your result
id x.1 x.2 x.3 x.4
1: 1000 1 2 3 4
2: 1001 1 2 3 4
In R, I have a large dataframe where the first two columns are the primary ID (object) and a secondary ID (element of the object).
I want to create a subset of this dataframe, with the condition that the primary and secondary ID had to be repeated in former dataframe for 20 times. I have also to repeat this process for other dataframes with the same structure.
Right now, I'm first counting how many times each couple of values (primary and secondary IDs) repeats itself in a new dataframe and then using a for loop to create the new dataframe, but the process is extremely slow and inefficient: the loop writes 20 rows/second starting from a dataframe that has from 500.000 to 1 million of rows.
for (i in 1:13){
x <- fread(dataframe_list[i]) #list which contains the dataframes that have to be analyzed
x1 <- ddply(x,.(Primary_ID,Secondary_ID), nrow) #creating a dataframe which shows how many times a couple of values repeats itself
x2 <- subset(x1, x1$V1 == 20) #selecting all couples that are repeated for 20 times
for (n in 1:length(x2$Primary_ID)){
x3 <- subset(x, (x$Primary_ID == x2$Primary_ID[n]) & (x$Secondary_ID == x2$Secondary_ID[n]))
outfiles <- paste0("B:/Results/Code_3_", Band[i], ".csv")
fwrite(x3, file=outfiles, append = TRUE, sep = ",")
}
}
How to take, for example, all the rows from the former dataframe that have as values for the primary and secondary ID the ones obtained in the x2 dataframe at once instead of writing one set of 20 rows at a time? Maybe in SQL is easier but I have to deal with R for now.
Edit:
Sure. Let's say I'm starting from a dataframe like this (with other rows with repeating IDs, I'll just stop to 5 rows to be short):
Primary ID Secondary ID Variable
1 1 1 0.5729
2 1 2 0.6289
3 1 3 0.3123
4 2 1 0.4569
5 2 2 0.7319
Then with my code I count in a new dataframe the repeated rows (for a threshold value of 4 instead of 20, so I can give you a short example):
Primary ID Secondary ID Count
1 1 1 1
2 1 2 3
3 1 3 4
4 2 1 2
5 2 2 4
The wanted output should be a dataframe like this:
Primary ID Secondary ID Variable
1 1 3 0.5920
2 1 3 0.6289
3 1 3 0.3123
4 1 3 0.4569
5 2 2 0.7319
6 2 2 0.5729
7 2 2 0.6289
8 2 2 0.3123
If anyone is interested, I managed to find a way. After counting with the code above how many times the couple of values is repeated, the output that I wanted can be obtained in this simple way:
#Select all the couples that are repeated 20 times
x2 <- subset(x1, x1$V1 == 20)
#Create a dataframe which contains the repeated primary and secondary IDs from x2
x3 <- as.data.frame(cbind(x2$Primary_ID, x2$Secondary_ID)
#Wanted output
dataframe <- inner_join(x, x3)
#Joining, by c("Primary_ID", "Secondary_ID")
I have a dataset "a" with a column "id" with about 23,000 rows, which are unique in this dataframe. I want to count the appearance frequency of these unique values in another two datasets "b" and "c".
To do this, I tried the code:
count1 <- as.data.frame(apply(a,1,function(x)sum(b$id==x[45])))
a <- cbind(a,count1)
, since "id" is the 45th column in the dataframe "b".
The code works for counting in b, but when I tried the same code for counting the frequency of "id" in dataframe "c":
count2 <- as.data.frame(apply(a,1,function(x)sum(c$id==x[17])))
"id" in dataframe "c" is in the 17th column. The frequencies of all "id"s are counted as 0, which is not the case it should be. Could anyone suggest where the problem is or how to fix this problem?
We can actually do this in a way that might at first seem a little weird, but is relatively straight forward. Let's start by working with just data frames a and b and let's simplify things a bit. Let's assume that the id variable in both a and b are the following:
a_id <- 1:5
b_id <- 1:5
In this simple example, a_id and b_id are exactly identical. What we want to know is how many times each of the values in a_id shows up in b_id. We obviously know the answer is one time each, but how do we get R to tell us that? That's where the table function can come in handy:
table(a_id, b_id)
# b_id
# a_id 1 2 3 4 5
# 1 1 0 0 0 0
# 2 0 1 0 0 0
# 3 0 0 1 0 0
# 4 0 0 0 1 0
# 5 0 0 0 0 1
That might look a little ugly, but you can see that we have our b_ids on the top (1-5) and our a_ids on the left-hand side. Down the diagonal, we see the counts for how many times each value of a_id shows up in b_id, and it's 1 each just like we already knew. So how do we get just that information? R has a nice function called diag that gets the main diagonal for us:
diag(table(a_id, b_id))
# 1 2 3 4 5
# 1 1 1 1 1
And there we have it. A vector with our "countif" values. But what if b_id doesn't have all of the values that are in a_id? If we try to do what we just did, we'll get an error because table doesn't like it when two vectors have different lengths. So we modify it a bit:
a_id <- 1:10
b_id <- 4:8
table(b_id[b_id %in% a_id])
# 4 5 6 7 8
# 1 1 1 1 1
A couple new things here. The use of %in% just asks R to tell us if a value exists in a vector. For example, 1 %in% 1:3 would return TRUE, but 4 %in% 1:3 would return FALSE. Next, you'll notice that we indexed b_id by using [. This only returns of the values of b_id where b_id %in% a_id is TRUE, which in this case is all of b_id.
So what does this look like if we expect more than one value of each a_id in b_id, but not every a_id value to be in b_id? Let's look at a more real example:
a_id <- 1:10
b_id <- sample(3:7, 1000, replace=TRUE)
table(b_id[b_id %in% a_id])
# 3 4 5 6 7
# 210 182 216 177 215
Like I said, it might seem a little weird at first, but it's relatively straight forward. Hopefully this helps you more than it confuses you.
I'm having trouble trying to replicate some of the countif function I'm familiar with in excel. I've got a data frame, and it has a large number of rows. I'm trying to take 2 variables (x & z) and do a countif of how many other variables within my dataframe match that. I figured out doing:
sum('mydataframe'$x==`mydataframe`$x[1]&`mydataframe'$z==`mydataframe`$z[1])
This gives me the correct countif for x&z within the whole data set for the first row [1]. The problem is I've got to use that [1]. I've tried using the (with,...) command, but then I can no longer access the whole column.
I'd like to be able to do the count of x & z combination for each row within the data frame then have that output as a new vector that I can just add as another column. And I'd like this to go on for every row through to the end.
Hopefully this is pretty simple. I figure some combination of (with,..) or apply or something will do it, but I'm just too new.
I am interested in a count total in every instance, not a running sequential count.
It seems that you are asking for a way to create a new column that contains the number of rows in the entire data frame with x and z value equal to the values of those variables for that row.
With a bit of sample data:
(dat <- data.frame(x=c(1, 1, 2), z=c(3, 3, 3)))
# x z
# 1 1 3
# 2 1 3
# 3 2 3
One simple approach would be grouping with dplyr's group_by function and then creating a new column with the number of elements in that group:
library(dplyr)
dat %>% group_by(x, z) %>% mutate(n=n())
# x z n
# (dbl) (dbl) (int)
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
A base R solution would probably involve ave:
dat$n <- ave(rep(NA, nrow(dat)), dat$x, dat$z, FUN=length)
dat
# x z n
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
An option using data.table would be to convert the 'data.frame' to 'data.table' (setDT(dat)) , group by 'x', 'z' and
assign 'n' as the number of elements in each group (.N).
library(data.table)
setDT(dat)[, n:= .N, by = .(x,z)]
dat
# x z n
#1: 1 3 2
#2: 1 3 2
#3: 2 3 1
I have a rather large dataset in a long format where I need to count the number of instances of the ID due to two different variables, A & B. E.g. The same person can be represented in multiple rows due to either A or B. What I need to do is to count the number of instances of ID which is not too hard, but also count the number of ID due to A and B and return these as variables in the dataset.
Regards,
//Mi
The ddply() function from the package plyr lets you break data apart by identifier variables, perform a function on each chunk, and then assemble it all back together. So you need to break your data apart by identifier and A/B status, count how many times each of those combinations occur (using nrow()), and then put those counts back together nicely.
Using wkmor1's df:
library(plyr)
x <- ddply(.data = df, .var = c("ID", "GRP"), .fun = nrow)
which returns:
ID GRP V1
1 1 a 2
2 1 b 2
3 2 a 2
4 2 b 2
And then merge that back on to the original data:
merge(x, df, by = c("ID", "GRP"))
OK, given the interpretations I see, then the fastest and easiest solution is...
df$IDCount <- ave(df$ID, df$group, FUN = length)
Here is one approach using 'table' to count rows meeting your criteria, and 'merge' to add the frequencies back to the data frame.
> df<-data.frame(ID=rep(c(1,2),4),GRP=rep(c("a","a","b","b"),2))
> id.frq <- as.data.frame(table(df$ID))
> colnames(id.frq) <- c('ID','ID.FREQ')
> df <- merge(df,id.frq)
> grp.frq <- as.data.frame(table(df$ID,df$GRP))
> colnames(grp.frq) <- c('ID','GRP','GRP.FREQ')
> df <- merge(df,grp.frq)
> df
ID GRP ID.FREQ GRP.FREQ
1 1 a 4 2
2 1 a 4 2
3 1 b 4 2
4 1 b 4 2
5 2 a 4 2
6 2 a 4 2
7 2 b 4 2
8 2 b 4 2