tidying data and reshaping key-value to wide format - r

This is not a easy problem for me, to be honest. I have searched quite a long time but there seems no similar question.
Here's how a few rows and columns of my data looks like:
V1 V2 V3
1 74c1c25f4b283fa74a5514307b0d0278 1#11:2241 1#10:249
2 08f5b445ec6b29deba62e6fd8b0325a6 20#7:249 20#5:83
3 4b7f6f4e2bf237b6cc58f57142bea5c0 4#16:249 24:913
So, the cells are in a format like "class(#subclass):value". I want to make a table like this:
V1 1#10 1#11 4#16 20#5 20#7 24
1 74c1c25f4b283fa74a5514307b0d0278 249 2241 0 0 0 0
2 08f5b445ec6b29deba62e6fd8b0325a6 0 0 0 83 249 0
3 4b7f6f4e2bf237b6cc58f57142bea5c0 0 0 249 0 0 913
Because I haven't met this kind of data structure before, I am not sure if this is the best way to store it. But so far, this is the only table format I could come up with. If you have any suggestion about it, please leave a comment.
Then, I first parsed it as the following:
V1 V2_1_1 V2_1_2 V2_2_1 V3_1_1 V3_1_2 V3_2_1
1 74c1c25f4b283fa74a5514307b0d0278 1 11 2241 1 10 249
2 08f5b445ec6b29deba62e6fd8b0325a6 20 7 249 20 5 83
3 4b7f6f4e2bf237b6cc58f57142bea5c0 4 16 249 24 NA 913
Now, I don't know how to convert it to the table format I want. Any package in R can I use to do it?
two links are attached below
original data: https://www.dropbox.com/s/aqay5dn4r3m3kdp/temp1TrainPoiFile.R?dl=0
parsed data:
https://www.dropbox.com/s/0oj8ic1pd2rew0h/temp3TrainPoiFile.R?dl=0
Thank you very much for you help. Please leave a comment if there's any question about it.
Thanks for Walt's and Jack's answer. I used tidyr to solve the problem. Below is how I did it.
Read file
source("temp1TrainPoiFile.R")
gather columns to key-value pair
temp2TrainPoiFile <- temp1TrainPoiFile %>% gather( key=V1, value=data, -V1)
extract to two columns
temp3TrainPoiFile <- temp2TrainPoiFile %>% extract(col=data, into=c("class","value"), regex="(.*):(.*)")
adding row numbers
row <- 1:nrow(temp3TrainPoiFile)
temp3TrainPoiFile <- cbind(row, temp3TrainPoiFile)
spread key-value to two columns
TrainPoiFile <- temp3TrainPoiFile %>% spread(key=class, value=value, fill=0)

This looks like a good example of the use of the tidyr package. Use gather to transform into a two column data frame using column V1 as the key and the other columns as the value column named data, extract to split the data column into class and value columns, and then spread to use the class column as new column names and the value columns as values. Code would look like:
library(tidyr)
library(dplyr)
class_table <- df %>% mutate(row = 1:nrow(.)) %>%
gather( key=V1, value=data, -c(V1,row)) %>%
extract(col=data, into=c("class","value"), regex="(.*):(.*)") %>%
spread(key=class, value=value, fill=0)
Edited to ensure uniqueness of row identifiers. mutate requires dplyr package.

Read in data
data <- source("temp1TrainPoiFile.R")[[1]]
Proper NAs
data[data == ""] <- NA
Reshape it into long format
data <- do.call(rbind, lapply(split(data, data[,"V1"]), function(n) {
id <- n[,1]
n <- na.omit(unlist(n[,-1]))
n <- strsplit(n, ":")
n <- do.call(rbind, lapply(n, function(m) data.frame(column = m[1], value = m[2])))
n <- data.frame(id = id, n)
n}))
Prepare for loop to insert the values into a newly created matrix
id <- unique(data[,"id"])
column <- unique(data[,"column"])
mat <- matrix(data = NA, nrow = length(id), ncol = length(column))
rownames(mat) <- id
colnames(mat) <- column
Insert the values
for(i in 1:nrow(data)) {
mat[data[i, "id"], data[i, "column"]] <- data[i,"value"]}

Related

Deriving cosine values for vector contrasts distributed over rows in a dataframe (rows to individual vectors)

I am attempting to use the lsa::cosine function to derive cosine values between vectors distributed across successive rows of a dataframe. My raw dataframe is structured with 15 numeric columns with each row denoting a unique vector
each row is a 15-item vector
My challenge is to create a new variable (e.g., cosineraw) that reflects cosine(vec1, vec2). Vec1 is the vector for Row1 and Vec2 is the vector for the next row (lead). I need this function to loop over rows for very large dataframes and am attempting to avoid a for loop. Essentially I need to compute a cosine value for each row contrasted to the next row stopping at the second to last row of the dataframe (since there is no cosine value for the last observation).
I've tried selecting observations rowwise:
dat <- mydat %>% rowwise %>% mutate(cosraw = cosine(as.vector(t(select_all))), as.vector(t(lead(select_all))))
but am getting an 'argument is not a matrix' error
In isolation, this code snippet works:
maybe <- lsa::cosine(as.vector(t(dat[2,])), as.vector(t(dat[1,])))
The problem is that the row index must be relative. This only works successfully for row1 vs. row2 not as the basis for a function rolling across all rows.
Is there a way to do this avoiding a 'for' loop?
Here's a base R solution:
# Load {lsa}
library(lsa)
# Generate data with 250k rows and 300 columns
gen_list <- lapply(1:250000, function(i){
rnorm(300)
})
# Convert to matrix
mat <- t(simplify2array(gen_list))
# Obtain desired values
vals <- unlist(
lapply(
2:nrow(mat), function(i){
cosine(mat[i-1,], mat[i,])
}
)
)
You can ignore the gen_list code as this was to generate example data.
You will want to convert your data frame to a matrix to make it compatible with the {lsa} package.
Runs quickly -- 3.39 seconds on my computer
My answer is similar to Kat's, but I firstly packaged the 15 row values into a list and then created a new column with leading list of lists.
Here is a reproducible data
library(dplyr)
library(tidyr)
library(lsa)
set.seed(1)
df <- data.frame(replicate(15,runif(10)))
The actual workflow:
df %>%
rowwise %>%
summarise(row_v = list(c_across())) %>%
mutate(nextrow_v = lead(row_v)) %>%
replace_na(list(nextrow_v=list(rep(NA, 15)))) %>% # replace NA with a list of NAs
rowwise %>%
summarise(cosr = cosine(unlist(row_v), unlist(nextrow_v)))
# A tibble: 10 x 1
# Rowwise:
cosr[,1]
<dbl>
1 0.820
2 0.791
3 0.780
4 0.785
5 0.838
6 0.808
7 0.718
8 0.743
9 0.773
10 NA
I'm assuming that you aren't looking for vectorization, as well (i.e., lapply or map).
This works, but it's a bit cumbersome. I didn't have any actual data from you so I made my own.
library(lsa)
library(tidyverse)
set.seed(1)
df1 <- matrix(sample(rnorm(15 * 11, 1, .1), 15 * 10), byrow = T, ncol = 15)
Then I created a copy of the data to use as the lead, because for the mutate to work, you need to lead columnwise, but aggregate rowwise. (That doesn't sound quite right, but hopefully, you can make heads or tails of it.)
df2 <- df1
df3 <- df2[-1, ] # all but the first row
df3 <- rbind(df3, rep(NA, 15)) # fill the missing row with NA
df2 <- cbind(df2, df3) %>% as.data.frame()
So now I've got a data frame that is 30 columns wide. the first 15 are my vector; the second 15 is the lead.
df2 %>%
rowwise %>%
mutate(cosr = cosine(c_across(V1:V15), c_across(V16:V30))) %>%
select(cosr) %>% unlist()
# cosr1 cosr2 cosr3 cosr4 cosr5 cosr6 cosr7 cosr8
# 0.9869402 0.9881976 0.9932426 0.9921418 0.9946119 0.9917792 0.9908216 0.9918681
# cosr9 cosr10
# 0.9972666 NA
If in doubt, you can always use a loop or vectorization to validate the numbers.
for(i in 1:(nrow(df1) - 1)) {
v1 <- df1[i, ] %>% unlist()
v2 <- df1[i + 1, ] %>% unlist()
message(cosine(v1, v2))
}
invisible(
lapply(1:(nrow(df1) - 1),
function(i) {message(cosine(unlist(df1[i, ]),
unlist(df1[i + 1, ])))}))

Function containing ifelse not working in dplyr, works fine outside

I'd like to write a function with two inputs (x and y) to create some mutated variables in a very large dataframe. Specifically, if x=y then return x, and if x!=y then draw 1 sample from a sequence of x to y.
The function works fine when I test it outside of my datafarme, but throws an error when I try to use it within mutate. I've tried both ifelse and if_else versions.
library(dplyr)
smx <- function(x,y){ #Function to allow sampling if length>1
if_else(x==y,x,sample(seq(x,y,1),1))}
#ifelse(x==y,x,sample(seq(x,y,1),1))} #Have also tried this with ifelse, doesn't work
smx(0,0) #This works
smx(0,5) #This works
#Create dummy data frame
df <- as.data.frame(cbind(c(rep(0,5)),c(seq(0,4,1))))
colnames(df) <- c("varA","varB")
df
#This doesn't work
df1 <- df %>% mutate(
VarC = smx(varA,varB)
)
Ideally, my output should include a third column (VarC) in which the first row is equal to 0 (because varA=varB) and the remaining rows are a random sample between a sequence from varA to varB.
I have set up my data frame so that varA is always be smaller than varB, but I'm not certain. Appreciate any help on a clean solution to this problem!
The function is not working because it is not vectorized. First, you'll need to vectorized your function, in order to make it work inside mutate.
You can do that as follows:
vectorized_fun <- Vectorize(your_fun)
Your code will look like this:
smx_v <- Vectorize(smx)
#This works
df1 <- df %>%
mutate(VarC = smx_v(varA,varB)
)
The issue here comes from seq: when using this function inside dplyr verb, you need to make sure the length of input is 1, which isn't the case here.
Using rowwise() solves the problem:
smx <- function(x,y){
ifelse(x==y,x,sample(seq(x,y,1),1))
}
df <- as.data.frame(cbind(c(rep(0,5)),c(seq(0,4,1))))
colnames(df) <- c("varA","varB")
df %>%
rowwise() %>%
mutate(VarC = smx(varA, varB))
Output:
# A tibble: 5 x 3
# Rowwise:
varA varB VarC
<dbl> <dbl> <dbl>
1 0 0 0
2 0 1 1
3 0 2 1
4 0 3 2
5 0 4 0

Make simple table from data frame using formula

I've got a large dataframe with 72 rows and 72 columns. All are numbers. I just want to make a table with the sum of the the first row multiplied by the sum of the first column and so on and so forth. So for example, if this is my df
x0 <- c(0,0,11,0)
x0.1 <- c(0,251,0,0)
x0.2 <- c(0,495,0,0)
x0.4 <- c(0,0,0,6)
df <- data.frame(x0,x0.1,x0.2,x0.4)
I want my table to look something like this
1 0
2 124911
3 5445
4 36
Would you be looking for:
rowSums(df)*colSums(df)
Solution using sapply:
sapply(1:nrow(df), function(i){sum(df[i,]*sum(df[,i]))})
Here is a tidyverse approach:
library(dplyr)
library(tidyr)
df %>%
unite(x, c(x0, x0.1, x0.2, x0.4), sep = "") %>%
mutate(x = sub("^0+(?!$)", "", x, perl=TRUE))
x
1 0
2 2514950
3 11000
4 6

R Subsetting text from a comma seperated column in a data-frame

I have a data.frame with a column that looks like that:
diagnosis
F.31.2,A.43.2,R.45.2,F.43.1
I want to somehow split this column into two colums with one containing all the values with F and one for all the other values, resulting in two columns in a df that looks like that.
F other
F.31.2,F43.1 A.43.2,R.45.2
Thanks in advance
Try next tidyverse approach. You can separate the rows by , and then create a group according to the pattern in order to reshape to wide and obtain the expected result:
library(dplyr)
library(tidyr)
#Data
df <- data.frame(diagnosis='F.31.2,A.43.2,R.45.2,F.43.1',stringsAsFactors = F)
#Code
new <- df %>% separate_rows(diagnosis,sep = ',') %>%
mutate(Group=ifelse(grepl('F',diagnosis),'F','Other')) %>%
pivot_wider(values_fn = toString,names_from=Group,values_from=diagnosis)
Output:
# A tibble: 1 x 2
F Other
<chr> <chr>
1 F.31.2, F.43.1 A.43.2, R.45.2
First, use strsplit at the commas. Then, using grep find indexes of F, and select/antiselect them by multiplying by 1 or -1 and paste them.
tmp <- el(strsplit(d$diagnosis, ","))
res <- lapply(c(1, -1), function(x) paste(tmp[grep("F", tmp)*x], collapse=","))
res <- setNames(as.data.frame(res), c("F", "other"))
res
# F other
# 1 F.31.2,F.43.1 A.43.2,R.45.2
Data:
d <- setNames(read.table(text="F.31.2,A.43.2,R.45.2,F.43.1"), "diagnosis")

How to transpose the first rows into new columns in R?

I want to transpose the first two rows into two new columns, and remain the rest of data frame. How do I do it in R?
My original data
A <- c("2012","PL",3,2)
B <- c("2012","PL",6,1)
C <- c("2012","PL",7,4)
DF <- data.frame(A,B,C)
My final data after transpose
V1 <- c("2012","2012")
V2 <- c("PL","PL")
A <- c(3,2)
B <- c(6,1)
C <- c(7,4)
DF <- data.frame(V1,V2,A,B,C)
Where V1 and V2 are the names for new columns and they are created automatically.
Thank you for any assistance.
Base R:
cbind(t(DF[1:2, 1, drop=FALSE]), DF[-(1:2),])
# Warning in data.frame(..., check.names = FALSE) :
# row names were found from a short variable and have been discarded
# 1 2 A B C
# 1 2012 PL 3 6 7
# 2 2012 PL 2 1 4
though I have some concerns about the apparent key property of "2012" and "PL". That is, you start with three instances of each and end with two. Logically it makes sense, though really to me it looks as if you have a matrix of numbers associated with a single "2012","PL", but perhaps that's not how the data is coming to you. (If you can change the format of the data before getting to this point such that you have a matrix and its associated keys, then it might make data munging more direct, declarative, and resistant to bugs.)
Here is an option with slice
library(dplyr)
DF %>%
select(A) %>%
slice(1:2) %>%
t %>%
as.data.frame %>%
bind_cols(DF %>%
slice(-(1:2)))

Resources