Take rows with a specific number of repeated values - r

In R, I have a large dataframe where the first two columns are the primary ID (object) and a secondary ID (element of the object).
I want to create a subset of this dataframe, with the condition that the primary and secondary ID had to be repeated in former dataframe for 20 times. I have also to repeat this process for other dataframes with the same structure.
Right now, I'm first counting how many times each couple of values (primary and secondary IDs) repeats itself in a new dataframe and then using a for loop to create the new dataframe, but the process is extremely slow and inefficient: the loop writes 20 rows/second starting from a dataframe that has from 500.000 to 1 million of rows.
for (i in 1:13){
x <- fread(dataframe_list[i]) #list which contains the dataframes that have to be analyzed
x1 <- ddply(x,.(Primary_ID,Secondary_ID), nrow) #creating a dataframe which shows how many times a couple of values repeats itself
x2 <- subset(x1, x1$V1 == 20) #selecting all couples that are repeated for 20 times
for (n in 1:length(x2$Primary_ID)){
x3 <- subset(x, (x$Primary_ID == x2$Primary_ID[n]) & (x$Secondary_ID == x2$Secondary_ID[n]))
outfiles <- paste0("B:/Results/Code_3_", Band[i], ".csv")
fwrite(x3, file=outfiles, append = TRUE, sep = ",")
}
}
How to take, for example, all the rows from the former dataframe that have as values for the primary and secondary ID the ones obtained in the x2 dataframe at once instead of writing one set of 20 rows at a time? Maybe in SQL is easier but I have to deal with R for now.
Edit:
Sure. Let's say I'm starting from a dataframe like this (with other rows with repeating IDs, I'll just stop to 5 rows to be short):
Primary ID Secondary ID Variable
1 1 1 0.5729
2 1 2 0.6289
3 1 3 0.3123
4 2 1 0.4569
5 2 2 0.7319
Then with my code I count in a new dataframe the repeated rows (for a threshold value of 4 instead of 20, so I can give you a short example):
Primary ID Secondary ID Count
1 1 1 1
2 1 2 3
3 1 3 4
4 2 1 2
5 2 2 4
The wanted output should be a dataframe like this:
Primary ID Secondary ID Variable
1 1 3 0.5920
2 1 3 0.6289
3 1 3 0.3123
4 1 3 0.4569
5 2 2 0.7319
6 2 2 0.5729
7 2 2 0.6289
8 2 2 0.3123

If anyone is interested, I managed to find a way. After counting with the code above how many times the couple of values is repeated, the output that I wanted can be obtained in this simple way:
#Select all the couples that are repeated 20 times
x2 <- subset(x1, x1$V1 == 20)
#Create a dataframe which contains the repeated primary and secondary IDs from x2
x3 <- as.data.frame(cbind(x2$Primary_ID, x2$Secondary_ID)
#Wanted output
dataframe <- inner_join(x, x3)
#Joining, by c("Primary_ID", "Secondary_ID")

Related

if i want to sort a column by size in rstudio, how do i make sure that the associated values of the rows sort with the column?

I have a data.frame with 1200 rows and 5 columns, where each row contains 5 values of one person. now i need to sort one column by size but I want the remaining columns to sort with the column, so that one column is sorted by increasing values and the other columns contain the values of the right persons. ( So that one row still contains data from one and the same person)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
these are the column names of my data.frame and I wanna sort it by the column called "avg"
First of all, please always provide us with a reproducible example such as below. The sorting of a data frame by default sorts all columns.
vector <- 1:3
BAPlotDET <- data.frame(vector, vector, vector, vector, vector)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
fsskiddet fspiddet avg diff absdiff
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
BAPlotDET <- BAPlotDET[order(-BAPlotDET$avg),]
> BAPlotDET
fsskiddet fspiddet avg diff absdiff
3 3 3 3 3 3
2 2 2 2 2 2
1 1 1 1 1 1

For loop to paste rows to create new dataframe from existing dataframe

New to SO, but can't figure out how to get this code to work. I have a dataframe that is very large, and is set up like this:
Number Year Type Amount
1 1 A 5
1 2 A 2
1 3 A 7
1 4 A 1
1 1 B 5
1 2 B 11
1 3 B 0
1 4 B 2
This goes onto multiple for multiple numbers. I want to take this dataframe and make a new dataframe that has two of the rows together, but it would be nested (for example, row 1 and row 2, row 1 and row 3, row 1 and row 4, row 2 and row 3, row 2 and row 4) where each combination of each year is together within types and numbers.
Example output:
Number Year Type Amount Number Year Type Amount
1 1 A 5 1 2 A 2
1 1 A 5 1 3 A 7
1 1 A 5 1 4 A 1
1 2 A 2 1 3 A 7
1 2 A 2 1 4 A 1
1 3 A 7 1 4 A 1
I thought that I would do a for loop to loop within number and type, but I do not know how to make the rows paste from there, or how to ensure that I am only getting the combinations of the rows once. For example:
for(i in 1:n_number){
for(j in 1:n_type){
....}}
Any tips would be appreciated! I am relatively new to coding, so I don't know if I should be using a for loop at all. Thank you!
df <- data.frame(Number= rep(1,8),
Year = rep(c(1:4),2),
Type = rep(c('A','B'),each=4),
Amount=c(5,2,7,1,5,11,0,2))
My interpretation is that you want to create a dataframe with all row combinations, where Number and Type are the same and Year is different.
First suggestion - join on Number and Type, then remove rows that have different Year. I added an index to prevent redundant matches (1 with 2 and 2 with 1).
df$index <- 1:nrow(df)
out <- merge(df,df,by=c("Number","Type"))
out <- out[which(out$index.x>out$index.y & out$Year.x!=out$Year.y),]
Second suggestion - if you want to see a version using a loop.
out2 <- NULL
for (i in c(1:(nrow(df)-1))){
for (j in c((i+1):nrow(df))){
if(df[i,"Year"]!=df[j,"Year"] & df[i,"Number"]==df[j,"Number"] & df[i,"Type"]==df[j,"Type"]){
out2 <- rbind(out2,cbind(df[i,],df[j,]))
}
}
}

remove duplicate row based only of previous row

I'm trying to remove duplicate rows from a data frame, based only on the previous row. The duplicate and unique functions will remove all duplicates, leaving you only with unique rows, which is not what I want.
I've illustrated the problem here with a loop. I need to vectorize this because my actual data set is much to large to use a loop on.
x <- c(1,1,1,1,3,3,3,4)
y <- c(1,1,1,1,3,3,3,4)
z <- c(1,2,1,1,3,2,2,4)
xy <- data.frame(x,y,z)
xy
x y z
1 1 1 1
2 1 1 2
3 1 1 1
4 1 1 1 #this should be removed
5 3 3 3
6 3 3 2
7 3 3 2 #this should be removed
8 4 4 4
# loop that produces desired output
toRemove <- NULL
for (i in 2:nrow(xy)){
test <- as.vector(xy[i,] == xy[i-1,])
if (!(FALSE %in% test)){
toRemove <- c(toRemove, i) #build a vector of rows to remove
}
}
xy[-toRemove,] #exclude rows
x y z
1 1 1 1
2 1 1 2
3 1 1 1
5 3 3 3
6 3 3 2
8 4 4 4
I've tried using dplyr's lag function, but it only works on single columns, when I try to run it over all 3 columns it doesn't work.
ifelse(xy[,1:3] == lag(xy[,1:3],1), NA, xy[,1:3])
Any advice on how to accomplish this?
Looks like we want to remove if the row is same as above:
# make an index, if cols not same as above
ix <- c(TRUE, rowSums(tail(xy, -1) == head(xy, -1)) != ncol(xy))
# filter
xy[ix, ]
Why don't you just iterate the list while keeping track of the previous row to compare it to the next row?
If this is true at some point: remember that row position and remove it from the list then start iterating from the beginning of the list.
Don't delete row while iterating because you will get concurrent modification error.

working with data in tables in R

I'm a newbie at working with R. I've got some data with multiple observations (i.e., rows) per subject. Each subject has a unique identifier (ID) and has another variable of interest (X) which is constant across each observation. The number of observations per subject differs.
The data might look like this:
ID Observation X
1 1 3
1 2 3
1 3 3
1 4 3
2 1 4
2 2 4
3 1 8
3 2 8
3 3 8
I'd like to find some code that would:
a) Identify the number of observations per subject
b) Identify subjects with greater than a certain number of observations (e.g., >= 15 observations)
c) For subjects with greater than a certain number of observations, I'd like to to manipulate the X value for each observation (e.g., I might want to subtract 1 from their X value, so I'd like to modify X for each observation to be X-1)
I might want to identify subjects with at least three observations and reduce their X value by 1. In the above, individuals #1 and #3 (ID) have at least three observations, and their X values--which are constant across all observations--are 3 and 8, respectively. I want to find code that would identify individuals #1 and #3 and then let me recode all of their X values into a different variable. Maybe I just want to subtract 1 from each X value. In that case, the code would then give me X values of (3-1=)2 for #1 and 7 for #3, but #2 would remain at X = 4.
Any suggestions appreciated, thanks!
You can use the aggregate function to do this.
a) Say your table is named temp, you can find the total number of observations for each ID and x column by using the SUM function in aggregate:
tot =aggregate(Observation~ID+x, temp,FUN = sum)
The output will look like this:
ID x Observation
1 1 3 10
2 2 4 3
3 3 8 6
b) To see the IDs that are over a certain number, you can create a subset of the table, tot.
vals = tot$ID[tot$Observation>5]
Output is:
[1] 1 3
c) To change the values that were found in (b) you reference the subsetted data, where the number of observations is > 5, and then update those values.
tot$x[vals] = tot$x[vals]+1
The final output for the table is
ID x Observation
1 1 4 10
2 2 4 3
3 3 9 6
To change the original table, you can subset the table by the IDs you found
temp[temp$ID %in% vals,]$x = temp[temp$ID %in% vals,]$x + 1
a) Identify the number of observations per subject
you can use this code on each variable:
summary

How to convert summary output to a data frame?

I have summarized a column of a data frame (call this DATA) that consists of IDs so I get the total number of each ID in the given column. I'd like to convert this to another data frame (call this TOTALNUM), so I have two columns. The first column is the ID itself and the second column is the total number of each ID. Is this possible?
Sample data:
ids <- c(1,2,3,4,5,1,2,3,1,5,1,4,2,2,2)
info <- c("A","B","C","A","B","C","A","B","C","A","B","C","A","B","C")
DATA <- data.frame(ids, info)
DATA$ids <- as.factor(DATA$ids)
What I would like to put in a data frame:
Top row would be the first column in a new data frame.
Second row would be the second column in a new data frame.
summary(DATA$ids)
This is what I would like the data frame to look like:
ids nums
1 4
2 5
3 2
4 2
5 2
Thanks!!
With your approach, you can take advantage of the fact that summary returns a vector of counts, with names for each value of ids:
> my.summary <- summary(DATA$ids)
> data.frame(ids=names(my.summary), nums=my.summary)
ids nums
1 1 4
2 2 5
3 3 2
4 4 2
5 5 2
Or--and this approach is more straightforward--you can create a frequency table based on ids and then convert that to a data frame:
> as.data.frame(table(ids), responseName="nums")
ids nums
1 1 4
2 2 5
3 3 2
4 4 2
5 5 2

Resources