I have a csv file that looks like this:
I try to create an algorithm that goes like this:
Iterate through each row;
If the condition is Success,
if T1 == P1, increase score one point
if T2 == P2, increase score one point
if T3 == P3, increase score one point
Else if the condition is Failure,
elif T1 != P1, increase score one point
elif T2 != P2, increase score one point
elif T3 != P3, increase score one point
However, I got stuck on 2 things:
When I say something like:
for (i in 1:4){
if (data[i,7] == "Success")
.......
There is a syntax problem because of using string with logic. How to get it right?
It doesn't calculate correctly when I state something like: if(data[i,1] == data[i,4]) {score = score+1}, but it does calculate correctly if I use numbers instead of letters in the csv file. Again, how to use strings with logic operators?
The other problem is using nested if statements. How to do it so I can use the algorithm above?
Thank you for your time!
We may also do this with across i.e. loop across the columns that starts_with 'T', then inside the loop, get the column names (cur_column()), replace the substring 'T', with 'P', and get its value, do a logical comparison, convert to numeric index by adding 1 (as R indexing starts from 1) to replace the values in vector (c(-1, 1)) based on the position index, and finally do a rowSums on the across output to create the 'total_score' column
library(dplyr)
library(stringr)
df %>%
mutate(total_score = rowSums(across(starts_with('T'),
~ c(-1, 1)[1 + (. == get(str_replace(cur_column(), 'T', 'P')))])))
-output
# A tibble: 4 x 5
T1 T2 P1 P2 total_score
<chr> <chr> <chr> <chr> <dbl>
1 a b a a 0
2 a a a a 2
3 a a a b 0
4 b a b b 0
data
df <- structure(list(T1 = c("a", "a", "a", "b"), T2 = c("b", "a", "a",
"a"), P1 = c("a", "a", "a", "b"), P2 = c("a", "a", "b", "b")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
a case_when structure can be used for your wish. Since we don't know how your data structure is, I created a dummy data which represents yours;
library(dplyr)
set.seed(1453)
scores <- data.frame(T1=sample(1:5,size = 200,replace = T),
T2=sample(1:5,size = 200,replace = T),
T3=sample(1:5,size = 200,replace = T),
P1=sample(1:5,size = 200,replace = T),
P2=sample(1:5,size = 200,replace = T),
P3=sample(1:5,size = 200,replace = T),
score=sample(50:100,size = 200,replace = T))
scores2 <- scores %>%
mutate(new_score=case_when(T1==P1 ~ score + 1,
T2==P2 ~ score + 1,
T3==P3 ~ score + 1,
TRUE ~ score - 1))
scores2%>%
head
Note: TRUE, means otherwise;
output;
T1 T2 T3 P1 P2 P3 score new_score
<int> <int> <int> <int> <int> <int> <int> <dbl>
1 4 2 2 1 5 5 64 63
2 3 5 4 2 1 3 82 81
3 5 1 5 4 5 5 89 90
4 2 5 3 4 5 1 62 63
5 3 5 4 3 2 4 53 54
6 3 1 4 1 3 2 82 81
If I got the problem right, each row is an observation, so I will compare each T column with the respective P column, than create a score for each comparison, finally I can sum them for each row.
Libraries
library(tidyverse)
Example Data
df <-
tibble(
T1 = c("a","a","a","b"),
T2 = c("b","a","a","a"),
P1 = c("a","a","a","b"),
P2 = c("a","a","b","b")
)
Code
df %>%
mutate(
S1 = if_else(T1 == P1, 1,-1),
S2 = if_else(T2 == P2, 1,-1)
) %>%
rowwise() %>%
mutate(total_score = sum(c_across(starts_with("S"))))
Output
# A tibble: 4 x 7
# Rowwise:
T1 T2 P1 P2 S1 S2 total_score
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 a b a a 1 -1 0
2 a a a a 1 1 2
3 a a a b 1 -1 0
4 b a b b 1 -1 0
Related
I understand there have been similar posts, but I'm a little confused on how I can use pivot_longer to transform my own data from wide to long format. The code below creates a mock dataset that's structured similar to my real data.
library(tidyverse)
## Dummy data.
# ID Variables.
part <- rep(rep(paste0("P", c(1:2)), each = 20, times = 2))
type <- rep(c("pre", "post"), each = 10, times = 4)
sp <- rep(c("slow", "mod"), each = 40)
# Values
var1_site1_L <- rep(c(1, NA), each = 5, times = 8)
var1_site1_R <- rep(c(1, NA), each = 5, times = 8)
var1_site1_ALL <- rep(1, times = 80)
var1_site1_ALL_M <- rep(c(1, rep(NA, times = 9)), times = 8)
var2_site2_L <- rep(c(1, NA), each = 5, times = 8)
var2_site2_R <- rep(c(1, NA), each = 5, times = 8)
var2_site2_ALL <- rep(1, times = 80)
var2_site2_ALL_M <- rep(c(1, rep(NA, times = 9)), times = 8)
dat <- data.frame(part, type, sp, var1_site1_L, var1_site1_R, var1_site1_ALL,
var1_site1_ALL_M, var2_site2_L, var2_site2_R, var2_site2_ALL,
var2_site2_ALL_M)
I want to be able to keep the variables part, type and sp as ID variables, but add the unique column name separators as additional ID variables with the specific value in the final column. For example, I'd like the result to be something similar to (note this is only a very basic example and, of course, there will be many more observations, including those NA values in the value column):
par type sp var site side misc value
p1 pre slow var1 site1 L NA 1
p1 pre slow var1 site1 R NA 1
p1 pre slow var1 site1 ALL NA 1
p1 pre slow var1 site1 ALL M 1
I know this is a pretty unique data structure. I'm particularly stuck on how to deal with the fourth column name separator (M) in some instances (cases where there is only a single value per ID variables).
I got up to the below code which I know needs a bit of work if I'm to achieve the result I'm after.
long <- dat %>%
pivot_longer(cols = c(1:3),
names_to = c("var", "site", "side", "misc"),
names_sep = "_")
Any help will be greatly appreciated!
I don't think you can get there with pivot_longer, but try this.
library(stringr)
results <- data.frame()
for (x in 4:length(dat)){
names <- names(dat[,c(1:3,x)])
res <- dat %>%
mutate(id = 1:nrow(dat)) %>%
select(id, names) %>%
mutate(var = str_extract(names[4],"var\\d"),
site = str_extract(names[4],"site\\d"),
side = str_extract(names[4],"L|R|ALL"),
misc = str_extract(names[4],"[M]"),
misc = ifelse(is.na(misc), "NA", misc)) %>%
rename("value" = 5) %>%
select(id, part, type, sp, var, site, side, misc, value)
results <- rbind(results, res)
}
head(results %>% arrange(id) %>% select(-id))
part type sp var site side misc value
1 P1 pre slow var1 site1 L NA 1
2 P1 pre slow var1 site1 R NA 1
3 P1 pre slow var1 site1 ALL NA 1
4 P1 pre slow var1 site1 ALL M 1
5 P1 pre slow var2 site2 L NA 1
6 P1 pre slow var2 site2 R NA 1
dat %>%
pivot_longer(starts_with('var')) %>%
separate(name, c('var', 'site', 'side', 'misc'), fill = 'right')
# A tibble: 640 x 8
part type sp var site side misc value
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 P1 pre slow var1 site1 L NA 1
2 P1 pre slow var1 site1 R NA 1
3 P1 pre slow var1 site1 ALL NA 1
4 P1 pre slow var1 site1 ALL M 1
5 P1 pre slow var2 site2 L NA 1
6 P1 pre slow var2 site2 R NA 1
7 P1 pre slow var2 site2 ALL NA 1
8 P1 pre slow var2 site2 ALL M 1
9 P1 pre slow var1 site1 L NA 1
10 P1 pre slow var1 site1 R NA 1
# ... with 630 more rows
I experimented with the results I produced in the earlier solution and pivot_wider then pivot_longer and found how to make it work for pivot_longer.Your original approach was very close.
dat %>%
pivot_longer(
cols = !c(part, type, sp),
names_to = c("var", "site", "side", "misc"),
names_sep = "_",
values_to = "value"
)
part type sp var site side misc value
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 P1 pre slow var1 site1 L NA 1
2 P1 pre slow var1 site1 R NA 1
3 P1 pre slow var1 site1 ALL NA 1
4 P1 pre slow var1 site1 ALL M 1
5 P1 pre slow var2 site2 L NA 1
6 P1 pre slow var2 site2 R NA 1
7 P1 pre slow var2 site2 ALL NA 1
8 P1 pre slow var2 site2 ALL M 1
9 P1 pre slow var1 site1 L NA 1
10 P1 pre slow var1 site1 R NA 1
I have two dataframes and I need to join informations.
Here the first df where I have different points (1,2,3..):
eleno elety resno
1 N 1
2 CA 1
3 C 1
4 O 1
5 CB 1
6 CG 1
The second one indicates distances between points, "eleno" represents the first point and "ele2" the second one:
eleno ele2 values
<chr> <chr> <dbl>
1 2 1.46
1 3 2.46
1 4 2.86
1 5 2.46
1 6 3.83
1 7 4.47
I'd like to have in the 1st df a new column with info from df 2. For example, for point 1 I'd like to have -2(second point):1.46(distance) , -3:2.46, -4:2.86 and so on, preferable in a one column.
Something like this
eleno elety resno dist
1 N 1 -2:1.46, -3:2.46, -4:2.86 ...
2 CA 1
3 C 1
4 O 1
5 CB 1
6 CG 1
Thank you!
If I understand your preference to one column, then a possibility without dplyr is as follows. First, we create the new column by concatenating the ele2 and values columns from df2 using the paste() function, with a colon as the separator:
new_column <- paste(-df2$ele2, df2$values, sep = ":")
Then, we use cbind() to bind it to df1:
new_df1 <- cbind(df1, ele2_values = new_column)
This will give us a new data frame like so:
eleno elety resno ele2_values
1 1 N 1 -2:1.46
2 2 CA 1 -3:2.46
3 3 C 1 -4:2.86
4 4 O 1 -5:2.46
5 5 CB 1 -6:3.83
6 6 CG 1 -7:4.47
Here is the data that I used, based on what you have given:
df1 <- data.frame(
eleno = 1:6,
elety = c("N", "CA", "C", "O", "CB", "CG"),
resno = rep(1, 6)
)
df2 <- data.frame(
eleno = rep(1, 6),
ele2 = 2:7,
values = c(1.46, 2.46, 2.86, 2.46, 3.83, 4.47)
)
If we want to get this column as a single element for each point, we can modify our code in the following manner:
Instantiate new_column as an empty vector:
new_column <- vector()
Then call some variant of *apply() or use a for loop to subset the original data frame by points, while applying our original code and appending our singular character elements back to new_column:
lapply(unique(df2$eleno), FUN = function(x) {
subset <- subset(df2, eleno == x)
new_elem <- paste(-subset$ele2, subset$values, sep = ":", collapse = ", ")
new_column <<- c(new_column, new_elem)
})
Once this operation is complete, we use cbind() as before to bind new_column to df1:
new_df1 <- cbind(df1, ele2_values = new_column)
Our output is as follows,
eleno elety resno ele2_values
1 1 N 1 -2:1.13703411305323, -3:6.22299404814839, -4:6.09274732880294, -5:6.23379441676661, -6:8.60915383556858, -7:6.40310605289415
2 2 CA 1 -2:0.094957563560456, -3:2.32550506014377, -4:6.66083758231252, -5:5.14251141343266, -6:6.93591291783378, -7:5.44974835589528
3 3 C 1 -2:2.82733583590016, -3:9.23433484276757, -4:2.92315840255469, -5:8.37295628152788, -6:2.86223284667358, -7:2.66820780001581
4 4 O 1 -2:1.86722789658234, -3:2.32225910527632, -4:3.16612454829738, -5:3.02693370729685, -6:1.59046002896503, -7:0.399959180504084
5 5 CB 1 -2:2.18799541005865, -3:8.10598552459851, -4:5.25697546778247, -5:9.14658166002482, -6:8.3134504687041, -7:0.45770263299346
6 6 CG 1 -2:4.56091482425109, -3:2.65186671866104, -4:3.04672203026712, -5:5.0730687007308, -6:1.81096208281815, -7:7.59670635452494
Here is my random data that I used for df2 in this case:
set.seed(1234)
df2 <- data.frame(
eleno = rep(1:6, rep(6, 6)),
ele2 = 2:7,
values = runif(length(rep(1:6, rep(6, 6)))) * 10
)
I have 5 data frames like the ones below:
df_mon <- data.frame(mon = as.factor(c(6, 7, 8, 9, 10)),
number = c(1.11, 1.02, 0.95, 0.92, 0.72))
df_year <- data.frame(year = as.factor(c(1, 2)),
number = c(1.61, 0.4))
df_cat <- data.frame(cat = c("A", "B", "C"),
number = c(1.11, 1.02, 0.44))
df_bin <- data.frame(bin = as.factor(c(1, 2)),
number = c(1.42, 0.56))
df_cat2 <- data.frame(cat2 = c("A", "B", "C", "D", "AA"),
number = c(0.11, 1.22, 1.34, 0.88, 0.75))
I need to multiple all the numbers in the 'number' columns from each of these data frames with each other. So, look at all the possible combinations in the first column in each data set and then take the number and multiple them. The final results data frame should look something like this (First 3 are done)
results_df <- data.frame(combi = c("mon6_year1_catA_bin1_cat2A", "mon6_year1_catA_bin1_cat2B", "mon6_year1_catA_bin1_cat2C"),
final_number = c(1.11*1.61*1.11*1.42*0.11, 1.11*1.61*1.11*1.42*1.22, 1.11*1.61*1.11*1.42*1.34))
We can see the first column in the the results_df shows what combination was used to calculate the final_number. The first example shows, the 'number' column from mon_df cat 6 (1.11) is taken and multiplied with the following:
category 1 (1.61) from df_year
category A (1.11) from df_cat
category 1 (1.42) from df_bin
category A (0.11) from df_cat2
The answer for this combination is 1.11 x 1.61 x 1.11 x 1.42 x 0.11 = 0.3098.
The 2nd row shows the next possible combination and so on.
I'm not sure how to achieve this, so any help will be greatly appreciated!
Maybe you can try expand.grid like below
lst <- list(df_mon, df_year, df_cat, df_bin, df_cat2)
results_df <- data.frame(
combi = do.call(
paste,
c(do.call(
expand.grid,
lapply(lst, function(v) paste0(names(v[1]), v[, 1]))
), sep = "_")
),
final_number = Reduce(
"*",
do.call(
expand.grid,
lapply(lst, `[[`, 2)
)
)
)
which gives
> head(results_df)
combi final_number
1 mon6_year1_catA_bin1_cat2A 0.30985097
2 mon7_year1_catA_bin1_cat2A 0.28472792
3 mon8_year1_catA_bin1_cat2A 0.26518777
4 mon9_year1_catA_bin1_cat2A 0.25681342
5 mon10_year1_catA_bin1_cat2A 0.20098441
6 mon6_year2_catA_bin1_cat2A 0.07698161
Here is an approach using dplyr and tidyr.
df_all <- df_mon %>%
full_join(df_year, by = character()) %>% # by = character() ensures cross join
full_join(df_cat, by = character()) %>%
full_join(df_bin, by = character()) %>%
full_join(df_cat2, by = character()) %>%
pivot_longer(cols = c(-mon, -year, -cat, -bin, -cat2)) %>%
group_by(mon, year, cat, bin, cat2) %>%
summarize(final_number = prod(value), .groups = "keep")
# A tibble: 300 x 6
# Groups: mon, year, cat, bin, cat2 [300]
mon year cat bin cat2 final_number
<fct> <fct> <chr> <fct> <chr> <dbl>
1 6 1 A 1 A 0.310
2 6 1 A 1 AA 2.11
3 6 1 A 1 B 3.44
4 6 1 A 1 C 3.77
5 6 1 A 1 D 2.48
6 6 1 A 2 A 0.122
7 6 1 A 2 AA 0.833
8 6 1 A 2 B 1.36
9 6 1 A 2 C 1.49
10 6 1 A 2 D 0.978
# ... with 290 more rows
It keeps the variables from the other data.frames intact as columns for further analysis, but you could create your combi column with a little paste().
I want to remove the rows which have the same two or more words after each other, like a sequence. This is to do a sequential pattern mining analysis.
I already tried the distinct() and duplicated() function, but this only removes the
whole row.
r_seq_5 <- r_seq_5[!duplicated(r_seq_5),] # remove duplicates
# Su Score result ROI next_roi third_roi four_roi five_roi
# 1 1 90 high Elsewhere Elsewhere Teacher Teacher Teacher
# 2 1 90 high Elsewhere Teacher Teacher Teacher Teacher
# 3 1 90 high Teacher Pen Teacher Elsewhere Smartboard
This is the table. If Teacher is two or three times in the sentence it doesn't matter, as long as it is not after each other.
The desired result is:
# 1 1 90 high Teacher Pen Teacher Elsewhere Smartboard
To do this, I have found it convenient to turn the factors into numbers. And this was my first step, because to compare macth of columns this path seems to be less arduous.
For this I used a for, the qdap package, because in macth I replaced the values with NA.
library(dplyr)
library(qdap)
df <- data.frame(Su = rep(1,3),
Score = rep(90,3),
ROI = c("A", "A", "B"),
NETX_ROI = c("A", "B", "C"),
third_roi = rep("B", 3),
four_roi = c("B", "B", "A"),
five_roi = c("B", "B", "D"))
df
> df
Su Score ROI NETX_ROI third_roi four_roi five_roi
1 1 90 A A B B B
2 1 90 A B B B B
3 1 90 B C B A D
df2 <- df
roi <- c("A", "B", "C", "D")
# A = Elsewhere
# B = Teacher
# C = Pen
# D = Smartboard
n <- seq(1, length.out = length(roi))
for (i in 1:length(n)) {
df2[df2 == roi[i]] <- NA
df2 <- qdap::NAer(df2, i)
}
> df2
Su Score ROI NETX_ROI third_roi four_roi five_roi
1 1 90 1 1 2 2 2
2 1 90 1 2 2 2 2
3 1 90 2 3 2 1 4
df2 <- df2 %>%
dplyr::select(-c(Su, Score)) %>%
as.matrix()
nn <- ncol(df2)
x <- matrix(nrow = nrow(df2), ncol = ncol(df2)-1)
for (i in 1:(nn-1)) {
xx <- ifelse(df2[,i] == df2[,i+1], NA, 0)
x[,i] <- as.matrix(xx)
}
> x
[,1] [,2] [,3] [,4]
[1,] NA 0 NA NA
[2,] 0 NA NA NA
[3,] 0 0 0 0
Finally, I just removed the lines with NA.
dfx <- x %>%
as.data.frame()
df_test <- df %>%
dplyr::bind_cols(dfx) %>%
na.omit() %>%
dplyr::select(1:ncol(df))
df_test
> df_test
Su Score ROI NETX_ROI third_roi four_roi five_roi
3 1 90 B C B A D
You can use gather() in order to regroup your variable, and then build a loop to identify in the value in the same as the precedent one.
Finally, use spread() in order to rebuild your inital structur.
df <- data.frame(
row = 1:4,
Su = 1,
Score = 90,
result = 'high',
ROI = c('A', 'A', 'B', 'A'),
ROI2 = c('A', 'B', 'C', 'B'),
ROI3 = c('B', 'B', 'A', 'C')
) %>%
gather(-(row:result), key = roi, value = value) %>%
arrange(row) %>%
mutate(repeated = 0)
for(i in 2:nrow(df)){
if(df$row[i] == df$row[i-1] & df$value[i] == df$value[i-1])
df$repeated[i] = 1
}
df %>%
group_by(row) %>%
mutate(repeated = sum(repeated)) %>%
filter(repeated == 0) %>%
select(-repeated) %>%
spread(key = roi, value = value)
# row Su Score result ROI ROI2 ROI3
# <int> <dbl> <dbl> <fct> <chr> <chr> <chr>
# 1 3 1 90 high B C A
# 2 4 1 90 high A B C
For each vertex, I am interested in the number of its adjacent edges based on a condition. In the following example, the condition is having a different gender.
Example:
library(igraph)
library(ggraph)
library(tidyverse)
nodes <- tibble(id = 1:4,
gender = c("M", "F", "F", "M"),
names = c("Bob", "Allie", "Mary", "Johnathon"))
edges <- tibble(from = c(1, 3, 2, 4, 1, 2, 1, 4),
to = c(2, 2, 4, 1, 3, 1, 4, 3))
network <- graph_from_data_frame(d = edges, vertices = nodes, directed = TRUE)
ggraph(network) +
geom_edge_link(arrow = arrow(length = unit(4,
'mm')),
start_cap = circle(4, 'mm'),
end_cap = circle(4, 'mm')) +
geom_node_text(aes(label = names)) +
theme_graph()
Desired result:
id name adjacent_edges
1 Bob 1
2 Allie 1
3 Mary 2
4 Johnathon 1
Here's an approach combining base R with igraph:
nodes %>%
mutate(adjacent_edges = colSums(as.matrix(outer(gender, gender, `!=`) * as_adj(network)) != 0))
# A tibble: 4 x 4
# id gender names adjacent_edges
# <int> <chr> <chr> <dbl>
# 1 1 M Bob 1
# 2 2 F Allie 1
# 3 3 F Mary 2
# 4 4 M Johnathon 1
Here
outer(gender, gender, `!=`)
builds a matrix with TRUE entries when genders are different, while as_adj(network)) is the usual graph adjacency matrix. Then their product will have nonzero entries exactly when we want - in the case of connected nodes with different genders. Summing over such cases gives the desired result.
Here's another one, lengthier but also more transparent:
edges %>% full_join(nodes, by = c("from" = "id")) %>%
full_join(nodes, by = c("to" = "id"), suff = c(".from", ".to")) %>%
group_by(to, names.to) %>% summarise(adjacent_edges = sum(gender.to != gender.from)) %>%
rename(id = to, name = names.to)
# A tibble: 4 x 3
# Groups: id [4]
# id name adjacent_edges
# <dbl> <chr> <int>
# 1 1 Bob 1
# 2 2 Allie 1
# 3 3 Mary 2
# 4 4 Johnathon 1
In this case we start with the list of edges and twice add the list of nodes: one time as to have node information about the from edge, and one time as to have node information about the to edge, in the same row. Then it remains to summarise the data by summing all neighbours with different genders.