R- How to put multiple cases (rows) in one row - r

I am doing some coding in R and have encountered an issue.
I have a data set where participants were given the same question(s) a few different times. There is one id variable, a time variable which records which instance we are dealing with, and one outcome variable.
I did a little research and found a post similar to what I am trying to do.
Turning one row into multiple rows in r
I am trying to do the exact opposite thing of what is being done in this post.
I created this small code to give an idea of what I am dealing with.
A1
id time x
1000 1 1
1000 2 2
1000 3 3
1000 4 4
1001 1 1
1001 2 2
1001 3 3
1001 4 4
What I need to do is reorganize the data set so that each case is on one line and I repeat every X variable multiple times (x1 would be the first time point, x2 would be the second time point, etc). Here is a sample code of what I would like the final data frame to look like.
B1
id x1 x2 x3 x4
1000 1 2 3 4
1001 1 2 3 4
There are a few nuances in my code that make this situation really tricking. Some participants have many more x entries than other participants (some participants only have 1 or 2 different x values while others have 7 or 8). There is some missing data as well.
I have approached the problem in a few ways with no luck. I am not sure what the best way is to handle this situation. The attempts I have tried either require a lot of code, usually the same basic code repeating multiple times, or code that doesn't work. Here is what I have tried.
I tried to use a for loop. I tried to create a new variable to identify the participant by id and then identify the first time they are doing the survey, then I use the first x value. I would then repeat this for each time point(for time 2- find the second x value for a given participant, for time 3- find the third x value for a given participant, etc.). As I have currently anywhere from 1 to 10 time points, this involves a lot of for loops. Because some people don't have an 6 or 7th time, the code often doesn't run. Here is an example of the for loop I have tried.
for (i in A1$id) {
temp.txt<- paste (
c ("A1$x1 [A$id ==", i," & A$time == 1] <- A1$x"
), collapse = "")
eval (parse(text = temp.txt))
}
I tried to subset the data for each time point, then merge the data together at the end. If I try this, I have missing data, and I also encounter issues with variables names no longer being accepted (I think because the names are similar, R has an issue with renaming everything). Here is an example of what that code looks like.
t1 <- subset (A1, A$time == 1)
t2 <- subset (A1, A$time == 2)
t3 <- subset (A1, A$time == 3)
t4 <- subset (A1, A$time == 4)
Z1 <- merge (t1, t2, by = "id")
Z2 <- merge (Z1, t3, by = "id")
Z3 <- merge (Z2, t4, by = "id")
Is there a different/easier way to approach this issue? Thanks, I really appreciate it.

1) reshape This is referred to as converting long form to wide form. In base R we can use reshape giving the following data frame. Note that reshape assumes that if there are columns named id and time then those are the id and time columns but if they had been named something else we would have had to specify them using the appropriate reshape arguments.
reshape(DF, dir = "wide")
## id x.1 x.2 x.3 x.4
## 1 1000 1 2 3 4
## 5 1001 1 2 3 4
2) xtabs Another base R solution is to use xtabs which gives the following table object:
xtabs(x ~ ., DF)
## time
## id 1 2 3 4
## 1000 1 2 3 4
## 1001 1 2 3 4
3) tapply or tapply which gives this matrix:
with(DF, tapply(x, list(id, time), c))
## 1 2 3 4
## 1000 1 2 3 4
## 1001 1 2 3 4
4) pivot wider The tidyr package has pivot_wider to do this:
library(tidyr)
pivot_wider(DF, names_from = "time", values_from = x)
## # A tibble: 2 x 5
## id `1` `2` `3` `4`
## <int> <int> <int> <int> <int>
## 1 1000 1 2 3 4
## 2 1001 1 2 3 4
Note
The input in reproducible form:
Lines <- "id time x
1000 1 1
1000 2 2
1000 3 3
1000 4 4
1001 1 1
1001 2 2
1001 3 3
1001 4 4"
DF <- read.table(text = Lines, header = TRUE)

Using data.table you can try
library(data.table)
setDT(A1) #Converting into data.table
result <- dcast(A1, id~x, value.var= "time") #long to wide conversion
names(result)[-1]<- paste0("x.",names(result)[-1]) #setting the names accordingly
result #your result
id x.1 x.2 x.3 x.4
1: 1000 1 2 3 4
2: 1001 1 2 3 4

Related

Subset cases in which there were more than 3 observations in longitudinal data?

Have a set of longitudinal data in which measures were repeatedly collected at various waves (see example of set up below. As this sort of data goes however, there was attrition, with some waves stopping before the study ended. However, my analysis has the assumption that each participant have at least 3 observations
ID
Wave
Score
1000
0
5
1000
1
4
1001
0
6
1001
1
6
1001
2
7
How would I subset only those IDs (subjects) that have at least 3 observations? I've looked into similar questions on stackoverflow but they do not seem to fit this specific issue.
Method 1
# set as data table
setDT(df)
# calculate no. of waves per ID
df[, no_of_waves := .N, ID]
# subset
df[no_of_waves >= 3]
Method 2
# calculate no. of waves per ID
df[, no_of_waves := max(Wave), ID]
# subset
df[no_of_waves >= 3]
Using base R, you could try this one-liner.
out <- with(df, df[ID %in% names(which(sapply(split(df, ID), nrow) > 2)), ])
Output
> out
ID Wave Score
3 1001 0 6
4 1001 1 6
5 1001 2 7
Data
df <- data.frame(
ID = unlist(mapply(rep, 1000:1001, 2:3)),
Wave = c(0,1,0,1,2),
Score = c(5,4,6,6,7)
)

Take rows with a specific number of repeated values

In R, I have a large dataframe where the first two columns are the primary ID (object) and a secondary ID (element of the object).
I want to create a subset of this dataframe, with the condition that the primary and secondary ID had to be repeated in former dataframe for 20 times. I have also to repeat this process for other dataframes with the same structure.
Right now, I'm first counting how many times each couple of values (primary and secondary IDs) repeats itself in a new dataframe and then using a for loop to create the new dataframe, but the process is extremely slow and inefficient: the loop writes 20 rows/second starting from a dataframe that has from 500.000 to 1 million of rows.
for (i in 1:13){
x <- fread(dataframe_list[i]) #list which contains the dataframes that have to be analyzed
x1 <- ddply(x,.(Primary_ID,Secondary_ID), nrow) #creating a dataframe which shows how many times a couple of values repeats itself
x2 <- subset(x1, x1$V1 == 20) #selecting all couples that are repeated for 20 times
for (n in 1:length(x2$Primary_ID)){
x3 <- subset(x, (x$Primary_ID == x2$Primary_ID[n]) & (x$Secondary_ID == x2$Secondary_ID[n]))
outfiles <- paste0("B:/Results/Code_3_", Band[i], ".csv")
fwrite(x3, file=outfiles, append = TRUE, sep = ",")
}
}
How to take, for example, all the rows from the former dataframe that have as values for the primary and secondary ID the ones obtained in the x2 dataframe at once instead of writing one set of 20 rows at a time? Maybe in SQL is easier but I have to deal with R for now.
Edit:
Sure. Let's say I'm starting from a dataframe like this (with other rows with repeating IDs, I'll just stop to 5 rows to be short):
Primary ID Secondary ID Variable
1 1 1 0.5729
2 1 2 0.6289
3 1 3 0.3123
4 2 1 0.4569
5 2 2 0.7319
Then with my code I count in a new dataframe the repeated rows (for a threshold value of 4 instead of 20, so I can give you a short example):
Primary ID Secondary ID Count
1 1 1 1
2 1 2 3
3 1 3 4
4 2 1 2
5 2 2 4
The wanted output should be a dataframe like this:
Primary ID Secondary ID Variable
1 1 3 0.5920
2 1 3 0.6289
3 1 3 0.3123
4 1 3 0.4569
5 2 2 0.7319
6 2 2 0.5729
7 2 2 0.6289
8 2 2 0.3123
If anyone is interested, I managed to find a way. After counting with the code above how many times the couple of values is repeated, the output that I wanted can be obtained in this simple way:
#Select all the couples that are repeated 20 times
x2 <- subset(x1, x1$V1 == 20)
#Create a dataframe which contains the repeated primary and secondary IDs from x2
x3 <- as.data.frame(cbind(x2$Primary_ID, x2$Secondary_ID)
#Wanted output
dataframe <- inner_join(x, x3)
#Joining, by c("Primary_ID", "Secondary_ID")

countif within R repeated across each row

I'm having trouble trying to replicate some of the countif function I'm familiar with in excel. I've got a data frame, and it has a large number of rows. I'm trying to take 2 variables (x & z) and do a countif of how many other variables within my dataframe match that. I figured out doing:
sum('mydataframe'$x==`mydataframe`$x[1]&`mydataframe'$z==`mydataframe`$z[1])
This gives me the correct countif for x&z within the whole data set for the first row [1]. The problem is I've got to use that [1]. I've tried using the (with,...) command, but then I can no longer access the whole column.
I'd like to be able to do the count of x & z combination for each row within the data frame then have that output as a new vector that I can just add as another column. And I'd like this to go on for every row through to the end.
Hopefully this is pretty simple. I figure some combination of (with,..) or apply or something will do it, but I'm just too new.
I am interested in a count total in every instance, not a running sequential count.
It seems that you are asking for a way to create a new column that contains the number of rows in the entire data frame with x and z value equal to the values of those variables for that row.
With a bit of sample data:
(dat <- data.frame(x=c(1, 1, 2), z=c(3, 3, 3)))
# x z
# 1 1 3
# 2 1 3
# 3 2 3
One simple approach would be grouping with dplyr's group_by function and then creating a new column with the number of elements in that group:
library(dplyr)
dat %>% group_by(x, z) %>% mutate(n=n())
# x z n
# (dbl) (dbl) (int)
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
A base R solution would probably involve ave:
dat$n <- ave(rep(NA, nrow(dat)), dat$x, dat$z, FUN=length)
dat
# x z n
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
An option using data.table would be to convert the 'data.frame' to 'data.table' (setDT(dat)) , group by 'x', 'z' and
assign 'n' as the number of elements in each group (.N).
library(data.table)
setDT(dat)[, n:= .N, by = .(x,z)]
dat
# x z n
#1: 1 3 2
#2: 1 3 2
#3: 2 3 1

Using Table on data frame by mutliple variables

I have a data table that is in a "long" format, containing many entries for each unique ID. For example...
id <- c(1,1,1,2,2,2)
date <- c("A","A","B","C","C","C")
loc <- c("X", "X", "X", "X","Y","Z")
dfTest <- data.frame(id,date,loc)
Which creates a sample table.
id date loc
1 1 A X
2 1 A X
3 1 B X
4 2 C X
5 2 C Y
6 2 C Z
My goal is to create a table that looks like this.
id X Y Z
1 2 0 0
2 1 1 1
I would like to see how many times a location was visited uniquely. ID#1 visited X on day A and day B, giving a total unique visits of 2. I approached this using reshape, thinking to turn this into a "wide" format. However, I don't know how to factor in the second variable (the date). I'm trying to pull out the number of visits to each location on unique dates. The actual date itself otherwise does not matter, just that it identify the duplicate entries.
My current solution would be poor form in R (to use iterative loops to look at locations found within each unique date). I was hoping reshape, apply, aggregate, or perhaps another package may be of more help. I've looked through a bunch of other reshape guides, but am still a bit stuck on the clever way to do this.
By the sounds of it, you should be able to do what you need with:
table(unique(dfTest)[-2])
## loc
## id X Y Z
## 1 2 0 0
## 2 1 1 1
We can group by 'loc', 'id', get the length of unique elements of 'date' and use dcast to get the expected output.
library(data.table)#v1.9.6+
dcast(setDT(dfTest)[, uniqueN(date), .(loc, id)], id~loc, value.var='V1', fill=0)
# id X Y Z
#1: 1 2 0 0
#2: 2 1 1 1

aggregate over several variables in r

I have a rather large dataset in a long format where I need to count the number of instances of the ID due to two different variables, A & B. E.g. The same person can be represented in multiple rows due to either A or B. What I need to do is to count the number of instances of ID which is not too hard, but also count the number of ID due to A and B and return these as variables in the dataset.
Regards,
//Mi
The ddply() function from the package plyr lets you break data apart by identifier variables, perform a function on each chunk, and then assemble it all back together. So you need to break your data apart by identifier and A/B status, count how many times each of those combinations occur (using nrow()), and then put those counts back together nicely.
Using wkmor1's df:
library(plyr)
x <- ddply(.data = df, .var = c("ID", "GRP"), .fun = nrow)
which returns:
ID GRP V1
1 1 a 2
2 1 b 2
3 2 a 2
4 2 b 2
And then merge that back on to the original data:
merge(x, df, by = c("ID", "GRP"))
OK, given the interpretations I see, then the fastest and easiest solution is...
df$IDCount <- ave(df$ID, df$group, FUN = length)
Here is one approach using 'table' to count rows meeting your criteria, and 'merge' to add the frequencies back to the data frame.
> df<-data.frame(ID=rep(c(1,2),4),GRP=rep(c("a","a","b","b"),2))
> id.frq <- as.data.frame(table(df$ID))
> colnames(id.frq) <- c('ID','ID.FREQ')
> df <- merge(df,id.frq)
> grp.frq <- as.data.frame(table(df$ID,df$GRP))
> colnames(grp.frq) <- c('ID','GRP','GRP.FREQ')
> df <- merge(df,grp.frq)
> df
ID GRP ID.FREQ GRP.FREQ
1 1 a 4 2
2 1 a 4 2
3 1 b 4 2
4 1 b 4 2
5 2 a 4 2
6 2 a 4 2
7 2 b 4 2
8 2 b 4 2

Resources