Subset a distance matrix in R by values - r

I have a very large distance matrix (3678 x 3678) currently encoded as a data frame. Columns are named "1", "2", "3" and so on, the same for rows. So what I need to do is to find values <26 and different from 0 and to have the results in a second dataframe with two columns: the first one with index and the second one with the value. For example:
value
318-516 22.70601
...
where 318 is the row index and 516 is the column index.

Ok, I'm trying to recreate your situation (note: if you can, it's always helpful to include a few lines of your data with a dput command).
You should be able to use filter and some simple tidyverse commands (if you don't know how they work, run them line by line, always selecting commands up to the %>% to check what they are doing):
library(tidyverse)
library(tidylog) # gives you additional output on what each command does
# Creating some data that looks similar
data <- matrix(rnorm(25,mean = 26),ncol=5)
data <- as_tibble(data)
data <- setNames(data,c(1:5))
data %>%
mutate(row = row_number()) %>%
pivot_longer(-row, names_to = "column",values_to = "values", names_prefix = "V") %>%
# depending on how your column names look like, you might need to use a separate() command first
filter(values > 0 & values < 26) %>%
# if you want you can create an index column as well
mutate(index = paste0(row,"-",column)) %>%
# then you can get rid of the row and column
select(-row,-column) %>%
# move index to the front
relocate(index)

Related

Convert a column of list to dummy in R

I have a column with lists of variables.
Seperated by comma plus sometimes values for the variables set by "=".
See picture.
I want the variables as columns and within the columns TRUE/FALSE or 1/0 values plus if there is a value set by "=" an extra column for this value.
I guess it's a similar question to Pandas convert a column of list to dummies but I need it in R.
Since you haven't provided explicit data, I needed to recreate one from your screenshots (please, update at least textual data the next time, it helps recreate your task).
Those chunks of code are explained with comments, they use tidyverse functions from packages included at the top of the chunk. Result is what you asked for with the exception that columns eventnumber_value are named value_eventnumber since naming a variable or column with a name that starts with number is not a good practice.
I don't know what you need the data for, but from my experience the wide format of the data is less useful than wide format for most of the cases. Especially here, since I expect, that one event may happen only for one ID. Thus, dat_pivoted is more convenient to operate on.
library(tibble)
library(tidyr)
library(dplyr)
library(stringr)
dat <- tribble(
~post_event_list, ~date_time,
"239=20.00,200,20149,100,101,102,103,104,105,106,107,108,114,198", "2022-03-01 00:23:50",
"257,159", "2022-03-01 00:02:51",
"201,109,110,111,112", "2022-03-01 00:57:23"
)
dat_pivoted <- dat %>%
mutate(post_event_list = str_split(post_event_list, ",")) %>% # transform comma separated strings into character vectors
unnest_longer(post_event_list) %>% # split characters into separate rows
separate(post_event_list, sep = "=", into = c("var", "val"), fill = "right") %>% # separate variables from values (case of 'X=Y'), put NA as value if there is no value
mutate(val = as.numeric(val)) # treat 'val' column as numeric
dat_values <- dat_pivoted %>%
pivot_wider(id_cols = date_time, names_from = var, names_prefix = "value_", values_from = val) %>% # turn data into wide format -- make a column per each event value, present or not
select(!where(~ all(is.na(.x)))) # select only those values columns, where not every element is NA
dat_indicator <- dat_pivoted %>%
mutate(val = TRUE) %>% # each row indicates a presence of event -- change all values to TRUE
pivot_wider(id_cols = date_time, names_from = var, values_from = val, values_fill = FALSE) # pivot columns again, replacing resulting NAs witth FALSE
dat_transformed <- left_join(dat_indicator, dat_values)

How to find duplicated values in column in R [duplicate]

There is a similar question for PHP, but I'm working with R and am unable to translate the solution to my problem.
I have this data frame with 10 rows and 50 columns, where some of the rows are absolutely identical. If I use unique on it, I get one row per - let's say - "type", but what I actually want is to get only those rows which only appear once. Does anyone know how I can achieve this?
I can have a look at clusters and heatmaps to sort it out manually, but I have bigger data frames than the one mentioned above (with up to 100 rows) where this gets a bit tricky.
This will extract the rows which appear only once (assuming your data frame is named df):
df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]
How it works: The function duplicated tests whether a line appears at least for the second time starting at line one. If the argument fromLast = TRUE is used, the function starts at the last line.
Boths boolean results are combined with | (logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using ! thereby creating a boolean vector indicating lines appearing only once.
A possibility involving dplyr could be:
df %>%
group_by_all() %>%
filter(n() == 1)
Or:
df %>%
group_by_all() %>%
filter(!any(row_number() > 1))
Since dplyr 1.0.0, the preferable way would be:
data %>%
group_by(across(everything())) %>%
filter(n() == 1)
Try it
library(dplyr)
DF1 <- data.frame(Part = c(1,2,3,4,5), Age = c(23,34,23,25,24), B.P = c(87,76,75,75,78))
DF2 <- data.frame(Part =c(3,5), Age = c(23,24), B.P = c(75,78))
DF3 <- rbind(DF1,DF2)
DF3 <- DF3[!(duplicated(DF3) | duplicated(DF3, fromLast = TRUE)), ]

Transpose my R Dataset for association analysis

I am sort of a newbie with R and data manipulation and I am trying to transpose the UCI words dataset. The default dataset is currently structured as so.
Where the first column is the document number, the second column is the word number referencing another text file and the last column is the number of times the word occurs in the document. (For now, we can forget about the third column and I know how to drop it from the dataset.)
What I am trying to do is to transpose the dataset so that I can have each document's words in one row. So a simple example would be like this.
I tried using the t() function but it would transpose the entire dataset all together which is not what I want. I looked in using the dplyr package to help with the data manipulation but I am not getting any solid leads. If you guys have any sources or a particular direction you can nudge me towards accomplishing this that would helpfull.
Thank you!
Here's a solution using the tidyverse package (which includes dplyr). The trick is to first add another column to differentiate entries with the same value in the first column (document number) and then just change the data to wide format using pivot_wider.
library(tidyverse)
# Your data
df <- read.csv(text = "num word
1 61
2 76
1 89
3 211
3 296", sep = " ")
df %>%
# Group by num
group_by(num) %>%
# Add a rownumber to differentiate entries for the same first column value
mutate(rownum = row_number()) %>%
# Change data to wide format
pivot_wider(id = num,
names_from = rownum,
values_from = word)
So I was able to figure out how to accomplish this task. Hopefully, it helps other DS's in the future.
data <- read.table("docword.kos.txt", sep = " ")
data <- data %>% select(V1, V2)
trans <- data %>%
group_by(V1) %>%
summarise(words = paste(V2, collapse = ","))
trans <- trans %>% select(words)
What I ended up doing is using the tidyr package to perform some data wrangling and group my dataset by the first column. Then I exported and re uploaded the dataset after making some slight adjustments in notepad (Replaced the " from the generated csv file)
write.csv(trans, "~\trend.csv", row.names = FALSE)

R new column (variable) that rowSums across lists with NULL values

I have a data.frame that looks like this:
UID<-c(rep(1:25, 2), rep(26:50, 2))
Group<-c(rep(5, 25), rep(20, 25), rep(-18, 25), rep(-80, 25))
Value<-sample(100:5000, 100, replace=TRUE)
df<-data.frame(UID, Group, Value)
But I need the values separated into new rows so I run this:
df<-pivot_wider(df, names_from = Group,
values_from = Value,
values_fill = list(Value = 0))
Which introduces NULL into the dataset. Sorry, could not figure out a way to get an example dataset with NULL values. Note: this is now a tbl_df tbl data.frame
These aren't great variable names so I run this:
colnames(df)[which(names(df) == "20")] <- "pos20"
colnames(df)[which(names(df) == "5")] <- "pos5"
colnames(df)[which(names(df) == "-18")] <- "neg18"
colnames(df)[which(names(df) == "-80")] <- "neg80"
What I want to be able to do is create a new column (variable) that rowSums across columns. So I run this:
df<-df%>%
replace(is.na(.), 0) %>%
mutate(rowTot = rowSums(.[2:5]))
Which of course works on the example dataset but not on the one with NULL values. I have tried converting NULL to NA using df[df== "NULL"] <- NA but the values do not change. I have tried converting the lists to numeric using as.numeric(as.character(unlist(df[[2]]))) but I get an error telling me I have unequal number of rows, which I guess would be expected.
I realize there might be a better process to get my desired end result, so any suggestions to any of this is most appreciated.
EDIT: Here is a link to the actual dataset which will introduce Null values after using pivot_wider. https://drive.google.com/file/d/1YGh-Vjmpmpo8_sFAtGedxzfCiTpYnKZ3/view?usp=sharing
Difficult to answer with confidence without an actual reproducible example where the error occurs but I am going to take a guess.
I think your pivot_wider steps produces list columns (meaning some values are vectors) and that is why you are getting NULL values. Create a unique row for each Group and then use pivot_wider. Also rowSums has na.rm parameter so you don't need replace.
library(dplyr)
df %>%
group_by(temp) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = temp, values_from = numseeds) %>%
mutate(rowTot = rowSums(.[3:6], na.rm = TRUE))
Please change the column numbers according to your data in rowSums if needed.

How can I remove all duplicates so that NONE are left in a data frame?

There is a similar question for PHP, but I'm working with R and am unable to translate the solution to my problem.
I have this data frame with 10 rows and 50 columns, where some of the rows are absolutely identical. If I use unique on it, I get one row per - let's say - "type", but what I actually want is to get only those rows which only appear once. Does anyone know how I can achieve this?
I can have a look at clusters and heatmaps to sort it out manually, but I have bigger data frames than the one mentioned above (with up to 100 rows) where this gets a bit tricky.
This will extract the rows which appear only once (assuming your data frame is named df):
df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]
How it works: The function duplicated tests whether a line appears at least for the second time starting at line one. If the argument fromLast = TRUE is used, the function starts at the last line.
Boths boolean results are combined with | (logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using ! thereby creating a boolean vector indicating lines appearing only once.
A possibility involving dplyr could be:
df %>%
group_by_all() %>%
filter(n() == 1)
Or:
df %>%
group_by_all() %>%
filter(!any(row_number() > 1))
Since dplyr 1.0.0, the preferable way would be:
data %>%
group_by(across(everything())) %>%
filter(n() == 1)
Try it
library(dplyr)
DF1 <- data.frame(Part = c(1,2,3,4,5), Age = c(23,34,23,25,24), B.P = c(87,76,75,75,78))
DF2 <- data.frame(Part =c(3,5), Age = c(23,24), B.P = c(75,78))
DF3 <- rbind(DF1,DF2)
DF3 <- DF3[!(duplicated(DF3) | duplicated(DF3, fromLast = TRUE)), ]

Resources