if i want to sort a column by size in rstudio, how do i make sure that the associated values of the rows sort with the column? - r

I have a data.frame with 1200 rows and 5 columns, where each row contains 5 values of one person. now i need to sort one column by size but I want the remaining columns to sort with the column, so that one column is sorted by increasing values and the other columns contain the values of the right persons. ( So that one row still contains data from one and the same person)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
these are the column names of my data.frame and I wanna sort it by the column called "avg"

First of all, please always provide us with a reproducible example such as below. The sorting of a data frame by default sorts all columns.
vector <- 1:3
BAPlotDET <- data.frame(vector, vector, vector, vector, vector)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
fsskiddet fspiddet avg diff absdiff
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
BAPlotDET <- BAPlotDET[order(-BAPlotDET$avg),]
> BAPlotDET
fsskiddet fspiddet avg diff absdiff
3 3 3 3 3 3
2 2 2 2 2 2
1 1 1 1 1 1

Related

More efficient way to fuzzy match in R?

I am currently working on a data frame with 2 million lines (records). I am wanting to identify potential duplicate records for followup. Someone else has written for me a long code that works, but currently it is taking me overnight to run.
It uses the stringdist package. From what I understand, stringdist works by comparing one row, against all other rows in the data frame. So, a data frame with 5 rows would require 20 computations:
i.e.
row 1 compared to row 2
row 1 compared to row 3
row 1 compared to row 4
row 1 compared to row 5
row 2 compared to row 1
row 2 compared to row 3
row 2 compared to row 4
row 2 compared to row 5
row 3 compared to row 1
row 3 compared to row 2
row 3 compared to row 4
row 3 compared to row 5
row 4 compared to row 1
row 4 compared to row 2
row 4 compared to row 3
row 4 compared to row 5
row 5 compared to row 1
row 5 compared to row 2
row 5 compared to row 3
row 5 compared to row 4
An increase in the size of data frame would exponentially increase the time needed to complete the function. With my rather large data frame, obviously it takes a while.
My proposed solution is this: after comparing each row with all of the other rows in the data frame, is there a way to omit those rows from future computations? For example, in the example above, row 1 compared to row 2 would be the same as row 2 compared to row 1. Could we remove one of these calculations?
So, using the example data frame above, the only computations should be:
row 1 compared to row 2
row 1 compared to row 3
row 1 compared to row 4
row 1 compared to row 5
row 2 compared to row 3
row 2 compared to row 4
row 2 compared to row 5
row 3 compared to row 4
row 3 compared to row 5
row 4 compared to row 5
This is the section in a function in the code that looks for these duplicates in various columns - any ideas on how I can amend this?
lastName <- stringdist(DataND$SURNAME[rownumber],DataND$SURNAME, method='lv')
firstName <- stringdist(DataND$GIVEN.NAME[rownumber],DataND$GIVEN.NAME, method='lv')
birthDate <- stringdist(DataND$DOB[rownumber],DataND$DOB, method='lv')
streetAddress<-stringdist(DataND$ADDR.1[rownumber],DataND$ADDR.1, method='lv')
suburb <- stringdist(DataND$LOCALITY[rownumber],DataND$LOCALITY, method='lv')
H 1's idea is great. Another option would be the fuzzyjoin-package.
library(fuzzyjoin)
library(dplyr)
df <- tibble(id = seq(1,10),
words = replicate(10, paste(sample(LETTERS, 5), collapse = "")))
stringdist_left_join(df, df, by = c(words = "words"), max_dist = 5, method = "lv", distance_col = "distance") %>%
filter(distance != 0)
# A tibble: 90 x 5
id.x words.x id.y words.y distance
<int> <chr> <int> <chr> <dbl>
1 1 JUQYR 2 HQMFD 5
2 1 JUQYR 3 WHQOM 4
3 1 JUQYR 4 OUWJV 4
4 1 JUQYR 5 JURGD 3
5 1 JUQYR 6 ZMLAQ 5
6 1 JUQYR 7 RWLVU 5
7 1 JUQYR 8 AYNLE 5
8 1 JUQYR 9 AUPVJ 4
9 1 JUQYR 10 JDFEY 4
10 2 HQMFD 1 JUQYR 5
# ... with 80 more rows
Here you have it all set up in the end, you can pick and dismiss rows by distance. It took 11 seconds for 100.000 records. Trying with stringdistmatrix() however I got the error:
Error: cannot allocate vector of size 37.3 Gb
lastName<-stringdist(DataND$SURNAME[rownumber], DataND$SURNAME, method='lv')
If i understand this line, it compar one surname (according the value of rownumber) with aller surnames. So when you change rownumber, all comparisons are made, even the ones already done precedently.
To prevent this, try:
lastName<-stringdist(DataND$SURNAME[rownumber], DataND$SURNAME[rownumber:nrows], method='lv')
where nrows is the number of rows

Subtract the value of two columns by 1 from a variable

How to subtract from a data.frame the value of two columns by 1
So far I couldn't find anything about how to address a column from a data.frame and subtract all values at one
myData:
src target
1 1
2 2
3 3
4 4
Should become:
src target
0 0
1 1
2 2
3 3
If we need to subtract the last two columns, get the last columns extracted with tail
nm1 <- tail(names(df1), 2)
df2 <- df1[nm1] -1

For loop to paste rows to create new dataframe from existing dataframe

New to SO, but can't figure out how to get this code to work. I have a dataframe that is very large, and is set up like this:
Number Year Type Amount
1 1 A 5
1 2 A 2
1 3 A 7
1 4 A 1
1 1 B 5
1 2 B 11
1 3 B 0
1 4 B 2
This goes onto multiple for multiple numbers. I want to take this dataframe and make a new dataframe that has two of the rows together, but it would be nested (for example, row 1 and row 2, row 1 and row 3, row 1 and row 4, row 2 and row 3, row 2 and row 4) where each combination of each year is together within types and numbers.
Example output:
Number Year Type Amount Number Year Type Amount
1 1 A 5 1 2 A 2
1 1 A 5 1 3 A 7
1 1 A 5 1 4 A 1
1 2 A 2 1 3 A 7
1 2 A 2 1 4 A 1
1 3 A 7 1 4 A 1
I thought that I would do a for loop to loop within number and type, but I do not know how to make the rows paste from there, or how to ensure that I am only getting the combinations of the rows once. For example:
for(i in 1:n_number){
for(j in 1:n_type){
....}}
Any tips would be appreciated! I am relatively new to coding, so I don't know if I should be using a for loop at all. Thank you!
df <- data.frame(Number= rep(1,8),
Year = rep(c(1:4),2),
Type = rep(c('A','B'),each=4),
Amount=c(5,2,7,1,5,11,0,2))
My interpretation is that you want to create a dataframe with all row combinations, where Number and Type are the same and Year is different.
First suggestion - join on Number and Type, then remove rows that have different Year. I added an index to prevent redundant matches (1 with 2 and 2 with 1).
df$index <- 1:nrow(df)
out <- merge(df,df,by=c("Number","Type"))
out <- out[which(out$index.x>out$index.y & out$Year.x!=out$Year.y),]
Second suggestion - if you want to see a version using a loop.
out2 <- NULL
for (i in c(1:(nrow(df)-1))){
for (j in c((i+1):nrow(df))){
if(df[i,"Year"]!=df[j,"Year"] & df[i,"Number"]==df[j,"Number"] & df[i,"Type"]==df[j,"Type"]){
out2 <- rbind(out2,cbind(df[i,],df[j,]))
}
}
}

sequential counting with input from more than one variable in r

I want to create a column with sequential values but it gets its value from input from two other columns in the df. I want the value to sequentially count if either Team changes (between 1 and 2) or Event = x. Any help would be appreciated! See example below:
Team Event Value
1 1 a 1
2 1 a 1
3 2 a 2
4 2 x 3
5 2 a 3
6 1 a 4
7 1 x 5
8 1 a 5
9 2 x 6
10 2 a 6
This will do it...
df$Value <- cumsum(df$Event=="x" | c(1, diff(df$Team))!=0)
It takes the cumulative sum (i.e. of TRUE values) of those elements where either Event=="x" or the difference in successive values of Team is non-zero. An extra element is added at the start of the diff term to keep it the same length as the original.

Sum of cells with same row and column name in R

I have a matrix created using table() command in R in which rows and columns do not have same values.
0 1 2
1 1 2 3
2 4 5 6
3 7 7 8
How can I sum the elements with the same row and column name? In this example it is equal to (2+6=)8.
Here's one approach:
# find the values present in both row names and column names
is <- do.call(intersect, unname(dimnames(x)))
# calculate the sum
sum(x[cbind(is, is)])
where x is your table.
Another one, self-explanatory:
sum(x[colnames(x)[col(x)] == rownames(x)[row(x)]])

Resources