Long format separation issue - r

From this dataframe:
dftest <- data.frame(id = c(1), text = c("java-ee?jsf?omnifaces?jpa"), stringsAsFactors = F)
I would like to produce a dataframe like this
data.frame(id = c(1), java-ee = c(1), jsf = c(1), onifaces = c(1), jpa = c(1))
I use this commands to make it:
s2 <- strsplit(dftest$text, split = "?")
dftest2 <- data.frame(id = rep(dftest2$id, sapply(s2, length)), text = unlist(s2))
dflike_final <- reshape(dftest2, idvar = "id", timevar = "text", direction = "wide")
Howver the results from the first two line is this:
id text
1 1 j
2 1 a
3 1 v
4 1 a
5 1 -
6 1 e
7 1 e
8 1 ?
9 1 j
10 1 s
11 1 f
12 1 ?
13 1 o
14 1 m
15 1 n
16 1 i
17 1 f
18 1 a
19 1 c
20 1 e
21 1 s
22 1 ?
23 1 j
24 1 p
25 1 a
How can I fix it to have the whole string?

We can bring the text in separate rows, create a dummy column (n) and get the data in wide format using pivot_wider.
library(dplyr)
library(tidyr)
dftest %>%
separate_rows(text, sep = "\\?") %>%
mutate(n = 1) %>%
pivot_wider(values_from = n, names_from = text)
# A tibble: 1 x 5
# id `java-ee` jsf omnifaces jpa
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1 1
As mentioned by #Roland ? is a special character in regex we need to escape it. Also you need to include a dummy column in creating the new dataframe. You can then use your attempt as
s2 <- strsplit(dftest$text, split = "\\?")
dftest2 <- data.frame(id = rep(dftest$id, lengths(s2)), text = unlist(s2), n = 1)
dflike_final <- reshape(dftest2, idvar = "id", timevar = "text", direction = "wide")

Related

Second Most Common Element in Each Row

I have this dataset in R:
library(stringr)
set.seed(999)
col1 = sample.int(5, 100, replace = TRUE)
col2 = sample.int(5, 100, replace = TRUE)
col3 = sample.int(5, 100, replace = TRUE)
col4 = sample.int(5, 100, replace = TRUE)
col5 = sample.int(5, 100, replace = TRUE)
col6 = sample.int(5, 100, replace = TRUE)
col7 = sample.int(5, 100, replace = TRUE)
col8 = sample.int(5, 100, replace = TRUE)
col9 = sample.int(5, 100, replace = TRUE)
col10 = sample.int(5, 100, replace = TRUE)
d = data.frame(id = 1:10, seq = c(paste(col1, collapse = ""), paste(col2, collapse = ""), paste(col3, collapse = ""), paste(col4, collapse = ""), paste(col5, collapse = ""), paste(col6, collapse = ""), paste(col7, collapse = ""), paste(col8, collapse = ""), paste(col9, collapse = ""), paste(col10, collapse = "")))
For each row, I would like to create new variables:
d$most_common: the most common element in each row
d$second_most_common: the second most common element in each row
d$third_most_common: the third most common element in each row
I tried to do this with the following function (Find the most frequent value by row):
rowMode <- function(x, ties = NULL, include.na = FALSE) {
# input checks data
if ( !(is.matrix(x) | is.data.frame(x)) ) {
stop("Your data is not a matrix or a data.frame.")
}
# input checks ties method
if ( !is.null(ties) && !(ties %in% c("random", "first", "last")) ) {
stop("Your ties method is not one of 'random', 'first' or 'last'.")
}
# set ties method to 'random' if not specified
if ( is.null(ties) ) ties <- "random"
# create row frequency table
rft <- table(c(row(x)), unlist(x), useNA = c("no","ifany")[1L + include.na])
# get the mode for each row
colnames(rft)[max.col(rft, ties.method = ties)]
}
rowMode(d[1,1])
This gave me an error:
Error in rowMode(d[1, 1]) : Your data is not a matrix or a data.frame.
Which is a bit confusing, seeing as "d" is a data.frame.
Is there an easier way to do this?
Thank you!
You can do this by splitting the long string on each character, pivoting longer, and counting instances by id and character, and taking the top 3..
Here is an approach using data.table
library(data.table)
setDT(d)
melt(d[, tstrsplit(seq,""), id], id.vars = "id")[, .N, .(id, value)][order(-N), .SD[1:3][,nth:=.I], id]
Output (first six rows of 30):
id value N nth
1: 2 2 30 1
2: 2 1 22 2
3: 2 4 19 3
4: 3 3 28 1
5: 3 2 23 2
6: 3 4 20 3
Here is a similar approach using dplyr with unnest() to make long:
d %>%
group_by(id) %>%
mutate(chars = strsplit(seq,"")) %>%
unnest(chars) %>%
count(id, chars,sort = T) %>%
slice_head(n=3)
Output:
id chars n
<int> <chr> <int>
1 1 1 24
2 1 5 20
3 1 2 19
4 2 2 30
5 2 1 22
6 2 4 19
7 3 3 28
8 3 2 23
9 3 4 20
10 4 1 26
If you need the variables "Most_common", "second_most":
You can use: mutate & str_count which counts each character in a string
library dplyr
#range
r <- 1:5 |> as.character()
d |>
group_by(id) |>
mutate(most_common = which(unique(str_count(seq, r)) == last(sort(str_count(seq, r)))),
second_most_common = first(which(str_count(seq, r) == nth(sort(str_count(seq, r)), length(r) - 1))),
third_most_common = first(which(str_count(seq, r) == nth(sort(str_count(seq, r)), length(r) - 2))))
id seq most_common second_most_com… third_most_comm…
<int> <chr> <int> <int> <int>
1 1 3451122353321532415512241532113224441251251254542314534141431523132515542431525553… 1 5 2
2 2 1432431521432121553144243252433424314222143112242423421524144222151123234314255321… 2 1 4
3 3 4232245131422525453443332555312143535325221555344453323342533222344112134311342335… 3 2 4
4 4 4252525524252335331144111244343534224454131341553141342131354215143133213214314241… 1 3 4
5 5 2223513245222513345115334422121115412343225125312335414233115453235322543311352331… 3 2 1
6 6 3244331444151221411123513334135553324122122233134315145451545423111325253225325141… 1 1 2
7 7 4353332532552141211553131123521145214552211231144155553152131124221522333222343355… 5 1 3
8 8 1432215433134223221222143432454314232514255344213444342235252213324245413213554121… 2 4 3
9 9 2335142431432434123121254343455134511124323335211514354553145531115232541551252421… 1 1 3
10 10 1552245312213342315524134513123511112311314321112334533252141242212345432435421535… 1 3 2

Random Sample From a Dataframe With Specific Count

This question is probably best illustrated with an example.
Suppose I have a dataframe df with a binary variable b (values of b are 0 or 1). How can I take a random sample of size 10 from this dataframe so that I have 2 instances where b=0 in the random sample, and 8 instances where b=1 in the dataframe?
Right now, I know that I can do df[sample(nrow(df),10,] to get part of the answer, but that would give me a random amount of 0 and 1 instances. How can I specify a specific amount of 0 and 1 instances while still taking a random sample?
Here's an example of how I'd do this... take two samples and combine them. I've written a simple function so you can "just take one sample."
With a vector:
pop <- sample(c(0,1), 100, replace = TRUE)
yoursample <- function(pop, n_zero, n_one){
c(sample(pop[pop == 0], n_zero),
sample(pop[pop == 1], n_one))
}
yoursample(pop, n_zero = 2, n_one = 8)
[1] 0 0 1 1 1 1 1 1 1 1
Or, if you are working with a dataframe with some unique index called id:
# Where d1 is your data you are summarizing with mean and sd
dat <- data.frame(
id = 1:100,
val = sample(c(0,1), 100, replace = TRUE),
d1 = runif(100))
yoursample <- function(dat, n_zero, n_one){
c(sample(dat[dat$val == 0,"id"], n_zero),
sample(dat[dat$val == 1,"id"], n_one))
}
sample_ids <- yoursample(dat, n_zero = 2, n_one = 8)
sample_ids
mean(dat[dat$id %in% sample_ids,"d1"])
sd(dat[dat$id %in% sample_ids,"d1"])
Here is a suggestion:
First create a sample of 0 and 1 with id column.
Then sample 2:8 df's with condition and bind them together:
library(tidyverse)
set.seed(123)
df <- as_tibble(sample(0:1,size=50,replace=TRUE)) %>%
mutate(id = row_number())
df1 <- df[ sample(which (df$value ==0) ,2), ]
df2 <- df[ sample(which (df$value ==1), 8), ]
df_final <- bind_rows(df1, df2)
value id
<int> <int>
1 0 14
2 0 36
3 1 21
4 1 24
5 1 2
6 1 50
7 1 49
8 1 41
9 1 28
10 1 33
library(tidyverse)
set.seed(123)
df <- data.frame(a = letters,
b = sample(c(0,1),26,T))
bind_rows(
df %>%
filter(b == 0) %>%
sample_n(2),
df %>%
filter(b == 1) %>%
sample_n(8)
) %>%
arrange(a)
a b
1 d 1
2 g 1
3 h 1
4 l 1
5 m 1
6 o 1
7 p 0
8 q 1
9 s 0
10 v 1

Nested if statement and logic with strings in csv file in R

I have a csv file that looks like this:
I try to create an algorithm that goes like this:
Iterate through each row;
If the condition is Success,
if T1 == P1, increase score one point
if T2 == P2, increase score one point
if T3 == P3, increase score one point
Else if the condition is Failure,
elif T1 != P1, increase score one point
elif T2 != P2, increase score one point
elif T3 != P3, increase score one point
However, I got stuck on 2 things:
When I say something like:
for (i in 1:4){
if (data[i,7] == "Success")
.......
There is a syntax problem because of using string with logic. How to get it right?
It doesn't calculate correctly when I state something like: if(data[i,1] == data[i,4]) {score = score+1}, but it does calculate correctly if I use numbers instead of letters in the csv file. Again, how to use strings with logic operators?
The other problem is using nested if statements. How to do it so I can use the algorithm above?
Thank you for your time!
We may also do this with across i.e. loop across the columns that starts_with 'T', then inside the loop, get the column names (cur_column()), replace the substring 'T', with 'P', and get its value, do a logical comparison, convert to numeric index by adding 1 (as R indexing starts from 1) to replace the values in vector (c(-1, 1)) based on the position index, and finally do a rowSums on the across output to create the 'total_score' column
library(dplyr)
library(stringr)
df %>%
mutate(total_score = rowSums(across(starts_with('T'),
~ c(-1, 1)[1 + (. == get(str_replace(cur_column(), 'T', 'P')))])))
-output
# A tibble: 4 x 5
T1 T2 P1 P2 total_score
<chr> <chr> <chr> <chr> <dbl>
1 a b a a 0
2 a a a a 2
3 a a a b 0
4 b a b b 0
data
df <- structure(list(T1 = c("a", "a", "a", "b"), T2 = c("b", "a", "a",
"a"), P1 = c("a", "a", "a", "b"), P2 = c("a", "a", "b", "b")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
a case_when structure can be used for your wish. Since we don't know how your data structure is, I created a dummy data which represents yours;
library(dplyr)
set.seed(1453)
scores <- data.frame(T1=sample(1:5,size = 200,replace = T),
T2=sample(1:5,size = 200,replace = T),
T3=sample(1:5,size = 200,replace = T),
P1=sample(1:5,size = 200,replace = T),
P2=sample(1:5,size = 200,replace = T),
P3=sample(1:5,size = 200,replace = T),
score=sample(50:100,size = 200,replace = T))
scores2 <- scores %>%
mutate(new_score=case_when(T1==P1 ~ score + 1,
T2==P2 ~ score + 1,
T3==P3 ~ score + 1,
TRUE ~ score - 1))
scores2%>%
head
Note: TRUE, means otherwise;
output;
T1 T2 T3 P1 P2 P3 score new_score
<int> <int> <int> <int> <int> <int> <int> <dbl>
1 4 2 2 1 5 5 64 63
2 3 5 4 2 1 3 82 81
3 5 1 5 4 5 5 89 90
4 2 5 3 4 5 1 62 63
5 3 5 4 3 2 4 53 54
6 3 1 4 1 3 2 82 81
If I got the problem right, each row is an observation, so I will compare each T column with the respective P column, than create a score for each comparison, finally I can sum them for each row.
Libraries
library(tidyverse)
Example Data
df <-
tibble(
T1 = c("a","a","a","b"),
T2 = c("b","a","a","a"),
P1 = c("a","a","a","b"),
P2 = c("a","a","b","b")
)
Code
df %>%
mutate(
S1 = if_else(T1 == P1, 1,-1),
S2 = if_else(T2 == P2, 1,-1)
) %>%
rowwise() %>%
mutate(total_score = sum(c_across(starts_with("S"))))
Output
# A tibble: 4 x 7
# Rowwise:
T1 T2 P1 P2 S1 S2 total_score
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 a b a a 1 -1 0
2 a a a a 1 1 2
3 a a a b 1 -1 0
4 b a b b 1 -1 0

Adding column with information from another dataframe R

I have two dataframes and I need to join informations.
Here the first df where I have different points (1,2,3..):
eleno elety resno
1 N 1
2 CA 1
3 C 1
4 O 1
5 CB 1
6 CG 1
The second one indicates distances between points, "eleno" represents the first point and "ele2" the second one:
eleno ele2 values
<chr> <chr> <dbl>
1 2 1.46
1 3 2.46
1 4 2.86
1 5 2.46
1 6 3.83
1 7 4.47
I'd like to have in the 1st df a new column with info from df 2. For example, for point 1 I'd like to have -2(second point):1.46(distance) , -3:2.46, -4:2.86 and so on, preferable in a one column.
Something like this
eleno elety resno dist
1 N 1 -2:1.46, -3:2.46, -4:2.86 ...
2 CA 1
3 C 1
4 O 1
5 CB 1
6 CG 1
Thank you!
If I understand your preference to one column, then a possibility without dplyr is as follows. First, we create the new column by concatenating the ele2 and values columns from df2 using the paste() function, with a colon as the separator:
new_column <- paste(-df2$ele2, df2$values, sep = ":")
Then, we use cbind() to bind it to df1:
new_df1 <- cbind(df1, ele2_values = new_column)
This will give us a new data frame like so:
eleno elety resno ele2_values
1 1 N 1 -2:1.46
2 2 CA 1 -3:2.46
3 3 C 1 -4:2.86
4 4 O 1 -5:2.46
5 5 CB 1 -6:3.83
6 6 CG 1 -7:4.47
Here is the data that I used, based on what you have given:
df1 <- data.frame(
eleno = 1:6,
elety = c("N", "CA", "C", "O", "CB", "CG"),
resno = rep(1, 6)
)
df2 <- data.frame(
eleno = rep(1, 6),
ele2 = 2:7,
values = c(1.46, 2.46, 2.86, 2.46, 3.83, 4.47)
)
If we want to get this column as a single element for each point, we can modify our code in the following manner:
Instantiate new_column as an empty vector:
new_column <- vector()
Then call some variant of *apply() or use a for loop to subset the original data frame by points, while applying our original code and appending our singular character elements back to new_column:
lapply(unique(df2$eleno), FUN = function(x) {
subset <- subset(df2, eleno == x)
new_elem <- paste(-subset$ele2, subset$values, sep = ":", collapse = ", ")
new_column <<- c(new_column, new_elem)
})
Once this operation is complete, we use cbind() as before to bind new_column to df1:
new_df1 <- cbind(df1, ele2_values = new_column)
Our output is as follows,
eleno elety resno ele2_values
1 1 N 1 -2:1.13703411305323, -3:6.22299404814839, -4:6.09274732880294, -5:6.23379441676661, -6:8.60915383556858, -7:6.40310605289415
2 2 CA 1 -2:0.094957563560456, -3:2.32550506014377, -4:6.66083758231252, -5:5.14251141343266, -6:6.93591291783378, -7:5.44974835589528
3 3 C 1 -2:2.82733583590016, -3:9.23433484276757, -4:2.92315840255469, -5:8.37295628152788, -6:2.86223284667358, -7:2.66820780001581
4 4 O 1 -2:1.86722789658234, -3:2.32225910527632, -4:3.16612454829738, -5:3.02693370729685, -6:1.59046002896503, -7:0.399959180504084
5 5 CB 1 -2:2.18799541005865, -3:8.10598552459851, -4:5.25697546778247, -5:9.14658166002482, -6:8.3134504687041, -7:0.45770263299346
6 6 CG 1 -2:4.56091482425109, -3:2.65186671866104, -4:3.04672203026712, -5:5.0730687007308, -6:1.81096208281815, -7:7.59670635452494
Here is my random data that I used for df2 in this case:
set.seed(1234)
df2 <- data.frame(
eleno = rep(1:6, rep(6, 6)),
ele2 = 2:7,
values = runif(length(rep(1:6, rep(6, 6)))) * 10
)

Dynamically select all columns but among ones that start with a certain word exclude all but keep one

I have many data frames that come in such a format:
df1 <- structure(list(ID = 1:2, Name = 1:2, Gender = 1:2, Group = 1:2,
FORMULA_RULE = 1:2, FORMULA_TRANSFORM = 1:2, FORMULA_UNITE = 1:2,
FORMULA_CALCULATE = 1:2, FORMULA_JOIN = 1:2), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(ID = 1:2, Name = 1:2, Gender = 1:2, FORMULA_RULE = 1:2,
FORMULA_META = c(NA, NA), FORMULA_DATA = 1:2, FORMULA_JOIN = 1:2,
FORMULA_TRANSFORM = 1:2, Group = 1:2), class = "data.frame", row.names = c(NA,
-2L))
View:
df1
ID Name Gender Group FORMULA_RULE FORMULA_TRANSFORM FORMULA_UNITE FORMULA_CALCULATE FORMULA_JOIN
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
df2
ID Name Gender FORMULA_RULE FORMULA_META FORMULA_DATA FORMULA_JOIN FORMULA_TRANSFORM Group
1 1 1 1 1 NA 1 1 1 1
2 2 2 2 2 NA 2 2 2 2
I want to write a code that would work on all such dataframes in a way that all columns are kept, but among the columns starts with FORMULA_, only FORMULA_TRANSFORM is selected. Please note that columns that do NOT start with FORMULA_ are not always the same, that is to say, I cannot simply write a code that always selects ID, Name, Gender, Group, and FORMULA_TRANSFORM, because there are some data frames that contain many other columns that do not start with FORMULA_ which I want to keep.
My attempt to solve this problem is this ugly code which works as expected:
library(tidyverse)
for(i in 1:length(ls(pattern = "df"))){
get(paste0("df", i)) %>%
select(-starts_with("FORMULA"),
(names(get(paste0("df", i))) %>% grep(pattern = "FORMULA", value = T))[!names(get(paste0("df", i))) %>% grep(pattern = "FORMULA", value = T) %in% "FORMULA_TRANSFORM"])
%>% print
}
Is there a more straight-forward way to do this?
With dplyr we can use select and it's pretty straight forward using starts_with and contains.
library(dplyr)
df1 %>%
select(-starts_with("FORMULA_"), contains("FORMULA_TRANSFORM"))
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
Let's try with a dataframe without "FORMULA_TRANSFORM" column
df3 <- df1
df3$FORMULA_TRANSFORM <- NULL
df3 %>%
select(-starts_with("FORMULA_"), contains("FORMULA_TRANSFORM"))
# ID Name Gender Group
#1 1 1 1 1
#2 2 2 2 2
With minus sign we are removing the columns that starts_with "FORMULA_" and selecting the one with "FORMULA_TRANSFORM". Instead of contains we can also use one_of() or matches() and it would still work.
Using base R we can use grep with invert and value set as TRUE
df1[c(grep("^FORMULA_", names(df1), invert = TRUE, value = TRUE),
"FORMULA_TRANSFORM")]
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
This creates a vector of column names where column name doesn't start with "FORMULA_" and we add "FORMULA_TRANSFORM" manually later.
The above method assumes that you always have "FORMULA_TRANSFORM" column in your dataframe and it will fail if there isn't. Safer option would be
get_selected_cols <- function(df1) {
cbind(df1[grep("^FORMULA_", names(df1), invert = TRUE)],
df1[names(df1) == "FORMULA_TRANSFORM"])
}
get_selected_cols(df1)
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
get_selected_cols(df3)
# ID Name Gender Group
#1 1 1 1 1
#2 2 2 2 2

Resources