r dataframe using rank - r

I would like to rank the row of a dataframe (with 30 columns) which has numerical values ranking from -inf to +inf.
This is what I have:
df <- structure(list(StockA = c("-5", "3", "6"),
StockB = c("2", "-1", "3"),
StockC = c("-3", "-4", "4")),
.Names = c( "StockA","StockB", "StockC"),
class = "data.frame", row.names = c(NA, -3L))
> df
StockA StockB StockC
1 -5 2 -3
2 3 -1 -4
3 6 3 4
This is what I would like to have:
> df_rank
StockA StockB StockC
1 3 1 2
2 1 2 3
3 1 3 2
I am using this command:
> rank(df[1,])
StockA StockB StockC
2 3 1
The resulting rank variables are not correct though as you can see.

rank() assigns the lowest rank to the smallest value.
So the short answer to your question is to use rank of the vector multiplied by -1:
rank (-c(-5, 2, -3) )
[1] 1 3 2
Here is the full code:
# data frame definition. The numbers should actually be integers as pointed out
# in comments, otherwise the rank command will sort them as strings
# So in the real word you should define them as integers,
# but to go with your data I will convert them to integers in the next step
df <- structure(list(StockA = c("-5", "3", "6"),
StockB = c("2", "-1", "3"),
StockC = c("-3", "-4", "4")),
.Names = c( "StockA","StockB", "StockC"),
class = "data.frame", row.names = c(NA, -3L))
# since you plan to rank them not as strings, but numbers, you need to convert
# them to integers:
df[] <- lapply(df,as.integer)
# apply will return a matrix or a list and you need to
# transpose the result and convert it back to a data.frame if needed
result <- as.data.frame(t( apply(df, 1, FUN=function(x){ return(rank(-x)) }) ))
result
# StockA StockB StockC
# 3 1 2
# 1 2 3
# 1 3 2

Related

Merging 2 data sets with different number of rows, matched on a column, and creating NA values

I'm trying to accomplish something that allows me to merge two datasets with differing number of rows, match them on a common column and create NA values where there isn't matching data. For some reason, when I'm merging, the newly created data frame is auto filling values that should be NA and creating extra rows that I don't want. I'm trying to merge df_add (which has a total of 6 rows) into df_main (which has a total of 4 rows) and match the 2 on column "match_id" in df_main and "other_id" in df_add.
df_main <- data.frame (match_id = c("1", "1", "2", "2"),
index_date = c("2006-09-13", "2006-09-13", "2006-09-13", "2006-09-13"),
type = c("Good", "Good", "Bad", "Bad")
)
df_add <- data.frame (other_id = c("1", "1", "1", "2", "2", "2"),
measure_date = c("2005-01-01", "2005-03-13", "2005-04-19", "2005-06-22", "2005-09-29", "2005-11-03"),
wt = c(10, 11, 15, 60, 42, 33)
)
This code is the closest I've gotten so far - it gives me the 6 rows that I want with the NA values but it doesn't match "match_id" and "other_id"
merge(df_main, df_add, by = 0, all = TRUE)[-1]
This is what I want my final merged data set to look like with only a total of 6 rows:
df_goal <- data.frame (match_id = c("1", "1", "1", "2", "2", "2"),
index_date = c("2006-09-13", "2006-09-13", NA, "2006-09-13", "2006-09-13", NA),
type = c("Good", "Good", NA, "Bad", "Bad", NA),
measure_date = c("2005-01-01", "2005-03-13", "2005-04-19", "2005-06-22", "2005-09-29", "2005-11-03"),
wt = c(10, 11, 15, 60, 42, 33)
)
df_goal
Is there a way to accomplish this in r? Any help would be greatly appreciated!
This is really not a merge operation, mostly a cbind by-id.
ids <- unique(c(df_main$match_id, df_add$other_id))
ids
# [1] "1" "2"
mains <- split(df_main, df_main$match_id)
adds <- split(df_add, df_add$other_id)
do.call(rbind,
Map(function(x1, x2) {
nr <- max(nrow(x1), nrow(x2))
cbind(
rbind(x1, x1[0,][rep(NA, nr - nrow(x1)),]),
rbind(x2, x2[0,][rep(NA, nr - nrow(x2)),])
)
}, mains[ids], adds[ids])
)
# match_id index_date type other_id measure_date wt
# 1.1 1 2006-09-13 Good 1 2005-01-01 10
# 1.2 1 2006-09-13 Good 1 2005-03-13 11
# 1.NA <NA> <NA> <NA> 1 2005-04-19 15
# 2.3 2 2006-09-13 Bad 2 2005-06-22 60
# 2.4 2 2006-09-13 Bad 2 2005-09-29 42
# 2.NA <NA> <NA> <NA> 2 2005-11-03 33
The use of [ids] is solely to ensure that the _id variables are in the same order. This will run into problems if an id is in one and not the other, though if that's a possibility then it's possible to overcome that ...
Below is a solution with the package data.table. I have added the variable id_row to define a grouping order with the *_id columns. Then you merge on this as well through an outer join.
library(data.table)
df_main <- data.frame (match_id = c("1", "1", "2", "2"),
index_date = c("2006-09-13", "2006-09-13", "2006-09-13", "2006-09-13"),
type = c("Good", "Good", "Bad", "Bad")
)
df_add <- data.frame (other_id = c("1", "1", "1", "2", "2", "2"),
measure_date = c("2005-01-01", "2005-03-13", "2005-04-19", "2005-06-22", "2005-09-29", "2005-11-03"),
wt = c(10, 11, 15, 60, 42, 33)
)
df_goal <- data.frame (match_id = c("1", "1", "1", "2", "2", "2"),
index_date = c("2006-09-13", "2006-09-13", NA, "2006-09-13", "2006-09-13", NA),
type = c("Good", "Good", NA, "Bad", "Bad", NA),
measure_date = c("2005-01-01", "2005-03-13", "2005-04-19", "2005-06-22", "2005-09-29", "2005-11-03"),
wt = c(10, 11, 15, 60, 42, 33)
)
# convert to data.table
setDT(df_main)
setDT(df_add)
# define a row counter by either match_id and other_id
df_main[ , id_row := 1L:.N, by = match_id]
df_add[ , id_row := 1L:.N, by = other_id]
# rename other_id to match_id
setnames(df_add, "other_id", "match_id")
# set joining keys
setkey(df_main, match_id, id_row)
setkey(df_add, match_id, id_row)
# do an outer join
out = df_main[ df_add ]
out
#> match_id index_date type id_row measure_date wt
#> 1: 1 2006-09-13 Good 1 2005-01-01 10
#> 2: 1 2006-09-13 Good 2 2005-03-13 11
#> 3: 1 <NA> <NA> 3 2005-04-19 15
#> 4: 2 2006-09-13 Bad 1 2005-06-22 60
#> 5: 2 2006-09-13 Bad 2 2005-09-29 42
#> 6: 2 <NA> <NA> 3 2005-11-03 33
Created on 2022-09-23 with reprex v2.0.2
You're missing a column to join by, we can create it and then slightly modify your code:
df_main$id2 <- ave(df_main$match_id, df_main$match_id, FUN = seq_along)
df_add$id2 <- ave(df_add$other_id, df_add$other_id, FUN = seq_along)
merge(df_main, df_add, by.x = c("match_id", "id2"), by.y = c("other_id", "id2"), all = TRUE)
#> match_id id2 index_date type measure_date wt
#> 1 1 1 2006-09-13 Good 2005-01-01 10
#> 2 1 2 2006-09-13 Good 2005-03-13 11
#> 3 1 3 <NA> <NA> 2005-04-19 15
#> 4 2 1 2006-09-13 Bad 2005-06-22 60
#> 5 2 2 2006-09-13 Bad 2005-09-29 42
#> 6 2 3 <NA> <NA> 2005-11-03 33
Created on 2022-09-27 by the reprex package (v2.0.1)

Count words in each cell of a dataframe in R

I have a dataframe that looks like
df <- structure(list(Variable = c("Factor1", "Factor2", "Factor3"),
Variable1 = c("word1, word2", "word1", "word1"),
Variable2 = c("word1", "word1, word2", "word1"),
Variable3 = c("word1, word2", "word1", "word1, word2, word3")),
row.names = c(NA, -3L), class = "data.frame")
and would like to create a df that counts occurrences of words in each cell (separated by ",") and input the number into each cell.
df2 <- structure(list(Variable = c("Factor1", "Factor2", "Factor3"),
Variable1 = c("2", "1", "1"),
Variable2 = c("1", "2", "1"),
Variable3 = c("2", "1", "3")),
row.names = c(NA, -3L), class = "data.frame")
Would someone be able to help me in how this would be done?
Thanks!
Using dplyr and stringi:
df %>%
mutate(across(matches("variable\\d{1,}"),stringi::stri_count_words))
Variable Variable1 Variable2 Variable3
1 Factor1 2 1 2
2 Factor2 1 2 1
3 Factor3 1 1 3
I suppose you could try this if desired a base-R solution. Count the number of characters with nchar of a given character value, and subtract the number of characters after removing commas. The difference would be the number of commas (adding 1 would give the number of words/phrases separated by commas). This should be fast too (also see this answer).
cbind(df[1], t(apply(df[-1], 1, \(x) {
nchar(x) - nchar(gsub(",", "", x, fixed = T)) + 1
})))
Output
Variable Variable1 Variable2 Variable3
1 Factor1 2 1 2
2 Factor2 1 2 1
3 Factor3 1 1 3

combining rows based on a condition in R

I am trying to remove some useless rows from the below df. There can be a type (1:5) per ID and yes_no variable to see if there is a variable recorded or not. As you can see, I would like to remove the 3rd and 5th rows as they have other rows with the same ID and type with a recorded value with yes_no = y.
df <- data.frame(ID = c("1", "1", "1", "1", "1", "1", "1", "1"), type = c("1", "2", "3", "3", "4", "4", "4", "5"), yes_no = c("n", "n", "n", "y", "n", "y", "y", "n"), value = c(NA, NA, NA, "2", NA, "5", "6", NA))
ID type yes_no value
1 1 n <NA>
1 2 n <NA>
1 3 n <NA>
1 3 y 2
1 4 n <NA>
1 4 y 5
1 4 y 6
1 5 n <NA>
The desired output is as follows:
df2 <- data.frame(ID = c("1", "1", "1", "1", "1", "1"), type = c("1", "2", "3", "4", "4", "5"), yes_no = c("n", "n", "y", "y", "y", "n"), value = c(NA, NA, "2", "5", "6", NA))
ID type yes_no value
1 1 n <NA>
1 2 n <NA>
1 3 y 2
1 4 y 5
1 4 y 6
1 5 n <NA>
There are ID's other than 1 that have types 1:5 so looks like I have to group_by(ID). A dplyr solution would be great too.
Any help would be appreciated, thanks!
You may use an if condition to check if yes_no has any y value.
library(dplyr)
df %>%
group_by(ID, type) %>%
filter(if(any(yes_no == 'y')) yes_no == 'y' else TRUE) %>%
ungroup
# ID type yes_no value
# <chr> <chr> <chr> <chr>
#1 1 1 n NA
#2 1 2 n NA
#3 1 3 y 2
#4 1 4 y 5
#5 1 4 y 6
#6 1 5 n NA
A base R option using subset + ave
subset(
df,
ave(yes_no == "y", ID, type, FUN = max) == (yes_no == "y")
)
gives
ID type yes_no value
1 1 1 n <NA>
2 1 2 n <NA>
4 1 3 y 2
6 1 4 y 5
7 1 4 y 6
8 1 5 n <NA>
After grouping by 'ID', 'type', we may use an OR (|) condition to filter to filter the groups where 'y' is present or when all elements are not 'y'
library(dplyr)
df %>%
group_by(ID, type) %>%
filter(yes_no == 'y'|all(yes_no != 'y')) %>%
ungroup
-output
# A tibble: 6 x 4
ID type yes_no value
<chr> <chr> <chr> <chr>
1 1 1 n <NA>
2 1 2 n <NA>
3 1 3 y 2
4 1 4 y 5
5 1 4 y 6
6 1 5 n <NA>

Get the length of an element on nested list

I have a list (of lists) that came from JSON (jsonlite) like this one (dput below)
{
"1":["123", "131", "342"],
"2":["123", "131"],
"3":["123", "131", "352"],
"4":["31", "352"],
"5":["153", "131"],
"6":["153", "131", "382"]
}
structure(list(`1` = c("123", "131", "342"), `2` = c("123", "131" ), `3` = c("123", "131", "352"), `4` = c("31", "352"), `5` = c("153", "131"), `6` = c("153", "131", "382")), .Names = c("1", "2", "3", "4", "5", "6"))
Then, I'm trying to convert it to a data frame with the key and the length of the nested list, like
V1 V2 V3 V4 V5 V6
1 1 2 3 4 5 6
2 3 2 3 2 2 3
with that code:
a = (read_file("ghist.json") %>% fromJSON)$hist # Reads my list from a JSON file
dates = data.frame() #Creates an empty data frame
#Iterate my list element by element
for(i in 1:length(a)){
dates[1, i] = strtoi(names(a)[i]) #Appends to my data frame on the first row, line 'i' the key from my list (index 'i'), as Integer
dates[2, i] = length(a[i]) #Here is my problem, it returns '1', not the real length of my list (index 'i')
}
print(dates) #Just debug
With the code above I'm getting
V1 V2 V3 V4 V5 V6
1 1 2 3 4 5 6
2 1 1 1 1 1 1
Note: I know the numbers on key are just crescent numbers, but it will become dates in ms
You can just use the built in lengths function to construct your data frame. This gives you the length of list elements, which is exactly what you want.
a <- structure(list(`1` = c("123", "131", "342"), `2` = c("123", "131"), `3` = c("123", "131", "352"), `4` = c("31", "352"), `5` = c("153", "131"), `6` = c("153", "131", "382")), .Names = c("1", "2", "3", "4", "5", "6"))
dates <- data.frame(
matrix(
data = c(names(a), lengths(a)),
ncol = length(a),
byrow = TRUE
)
)
dates
#> X1 X2 X3 X4 X5 X6
#> 1 1 2 3 4 5 6
#> 2 3 2 3 2 2 3
The bug in your code is very minor, thought I wouldn't recommend this approach. It's that you need length(a[[i]]). I suggest you look at some resources on subsetting in R, but to illustrate compare the two at the bottom. a[1] returns a list containing the desired element, a[[1]] returns the actual element. The length of a[1] is 1.
dates = data.frame()
for(i in 1:length(a)){
dates[1, i] = strtoi(names(a)[i])
dates[2, i] = length(a[[i]]) # changed here
}
dates
#> V1 V2 V3 V4 V5 V6
#> 1 1 2 3 4 5 6
#> 2 3 2 3 2 2 3
a[1]
#> $`1`
#> [1] "123" "131" "342"
a[[1]]
#> [1] "123" "131" "342"
Created on 2018-06-26 by the reprex package (v0.2.0).
Try nrow(a[i]) instead of length(a[i]) in your loop.

Changing values between two data.frames in R

I have the following data.frames(sample):
>df1
number ACTION
1 1 this
2 2 that
3 3 theOther
4 4 another
>df2
id VALUE
1 1 3
2 2 4
3 3 2
4 4 1
4 5 4
4 6 2
4 7 3
. . .
. . .
I would like df2 to become like the following:
>df2
id VALUE
1 1 theOther
2 2 another
3 3 that
4 4 this
4 5 another
4 6 that
4 7 theOther
. . .
. . .
It can be done 'mannualy' by using the following for each value:
df2[df2==1] <- 'this'
df2[df2==2] <- 'that'
.
.
and so on, but is there a way to do it not mannualy?
Try
df2$VALUE <- setNames(df1$ACTION, df1$number)[as.character(df2$VALUE)]
df2
# id VALUE
#1 1 theOther
#2 2 another
#3 3 that
#4 4 this
#5 5 another
#6 6 that
#7 7 theOther
Or use match
df2$VALUE <- df1$ACTION[match(df2$VALUE, df1$number)]
data
df1 <- structure(list(number = 1:4, ACTION = c("this", "that",
"theOther",
"another")), .Names = c("number", "ACTION"), class = "data.frame",
row.names = c("1", "2", "3", "4"))
df2 <- structure(list(id = 1:7, VALUE = c(3L, 4L, 2L, 1L, 4L, 2L, 3L
)), .Names = c("id", "VALUE"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7"))
You could do:
library(qdapTools)
df2$VALUE <- lookup(terms = df2$VALUE, key.match = df1)
Note that for this to work, you will need the proper columns order in df1. From ?lookup
key.match
Takes one of the following: (1) a two column data.frame of a match key
and reassignment column, (2) a named list of vectors (Note: if
data.frame or named list supplied no key reassign needed) or (3) a
single vector match key.
Which gives:
# id VALUE
#1 1 theOther
#2 2 another
#3 3 that
#4 4 this
#5 5 another
#6 6 that
#7 7 theOther

Resources