How to transpose the first rows into new columns in R? - r

I want to transpose the first two rows into two new columns, and remain the rest of data frame. How do I do it in R?
My original data
A <- c("2012","PL",3,2)
B <- c("2012","PL",6,1)
C <- c("2012","PL",7,4)
DF <- data.frame(A,B,C)
My final data after transpose
V1 <- c("2012","2012")
V2 <- c("PL","PL")
A <- c(3,2)
B <- c(6,1)
C <- c(7,4)
DF <- data.frame(V1,V2,A,B,C)
Where V1 and V2 are the names for new columns and they are created automatically.
Thank you for any assistance.

Base R:
cbind(t(DF[1:2, 1, drop=FALSE]), DF[-(1:2),])
# Warning in data.frame(..., check.names = FALSE) :
# row names were found from a short variable and have been discarded
# 1 2 A B C
# 1 2012 PL 3 6 7
# 2 2012 PL 2 1 4
though I have some concerns about the apparent key property of "2012" and "PL". That is, you start with three instances of each and end with two. Logically it makes sense, though really to me it looks as if you have a matrix of numbers associated with a single "2012","PL", but perhaps that's not how the data is coming to you. (If you can change the format of the data before getting to this point such that you have a matrix and its associated keys, then it might make data munging more direct, declarative, and resistant to bugs.)

Here is an option with slice
library(dplyr)
DF %>%
select(A) %>%
slice(1:2) %>%
t %>%
as.data.frame %>%
bind_cols(DF %>%
slice(-(1:2)))

Related

Deriving cosine values for vector contrasts distributed over rows in a dataframe (rows to individual vectors)

I am attempting to use the lsa::cosine function to derive cosine values between vectors distributed across successive rows of a dataframe. My raw dataframe is structured with 15 numeric columns with each row denoting a unique vector
each row is a 15-item vector
My challenge is to create a new variable (e.g., cosineraw) that reflects cosine(vec1, vec2). Vec1 is the vector for Row1 and Vec2 is the vector for the next row (lead). I need this function to loop over rows for very large dataframes and am attempting to avoid a for loop. Essentially I need to compute a cosine value for each row contrasted to the next row stopping at the second to last row of the dataframe (since there is no cosine value for the last observation).
I've tried selecting observations rowwise:
dat <- mydat %>% rowwise %>% mutate(cosraw = cosine(as.vector(t(select_all))), as.vector(t(lead(select_all))))
but am getting an 'argument is not a matrix' error
In isolation, this code snippet works:
maybe <- lsa::cosine(as.vector(t(dat[2,])), as.vector(t(dat[1,])))
The problem is that the row index must be relative. This only works successfully for row1 vs. row2 not as the basis for a function rolling across all rows.
Is there a way to do this avoiding a 'for' loop?
Here's a base R solution:
# Load {lsa}
library(lsa)
# Generate data with 250k rows and 300 columns
gen_list <- lapply(1:250000, function(i){
rnorm(300)
})
# Convert to matrix
mat <- t(simplify2array(gen_list))
# Obtain desired values
vals <- unlist(
lapply(
2:nrow(mat), function(i){
cosine(mat[i-1,], mat[i,])
}
)
)
You can ignore the gen_list code as this was to generate example data.
You will want to convert your data frame to a matrix to make it compatible with the {lsa} package.
Runs quickly -- 3.39 seconds on my computer
My answer is similar to Kat's, but I firstly packaged the 15 row values into a list and then created a new column with leading list of lists.
Here is a reproducible data
library(dplyr)
library(tidyr)
library(lsa)
set.seed(1)
df <- data.frame(replicate(15,runif(10)))
The actual workflow:
df %>%
rowwise %>%
summarise(row_v = list(c_across())) %>%
mutate(nextrow_v = lead(row_v)) %>%
replace_na(list(nextrow_v=list(rep(NA, 15)))) %>% # replace NA with a list of NAs
rowwise %>%
summarise(cosr = cosine(unlist(row_v), unlist(nextrow_v)))
# A tibble: 10 x 1
# Rowwise:
cosr[,1]
<dbl>
1 0.820
2 0.791
3 0.780
4 0.785
5 0.838
6 0.808
7 0.718
8 0.743
9 0.773
10 NA
I'm assuming that you aren't looking for vectorization, as well (i.e., lapply or map).
This works, but it's a bit cumbersome. I didn't have any actual data from you so I made my own.
library(lsa)
library(tidyverse)
set.seed(1)
df1 <- matrix(sample(rnorm(15 * 11, 1, .1), 15 * 10), byrow = T, ncol = 15)
Then I created a copy of the data to use as the lead, because for the mutate to work, you need to lead columnwise, but aggregate rowwise. (That doesn't sound quite right, but hopefully, you can make heads or tails of it.)
df2 <- df1
df3 <- df2[-1, ] # all but the first row
df3 <- rbind(df3, rep(NA, 15)) # fill the missing row with NA
df2 <- cbind(df2, df3) %>% as.data.frame()
So now I've got a data frame that is 30 columns wide. the first 15 are my vector; the second 15 is the lead.
df2 %>%
rowwise %>%
mutate(cosr = cosine(c_across(V1:V15), c_across(V16:V30))) %>%
select(cosr) %>% unlist()
# cosr1 cosr2 cosr3 cosr4 cosr5 cosr6 cosr7 cosr8
# 0.9869402 0.9881976 0.9932426 0.9921418 0.9946119 0.9917792 0.9908216 0.9918681
# cosr9 cosr10
# 0.9972666 NA
If in doubt, you can always use a loop or vectorization to validate the numbers.
for(i in 1:(nrow(df1) - 1)) {
v1 <- df1[i, ] %>% unlist()
v2 <- df1[i + 1, ] %>% unlist()
message(cosine(v1, v2))
}
invisible(
lapply(1:(nrow(df1) - 1),
function(i) {message(cosine(unlist(df1[i, ]),
unlist(df1[i + 1, ])))}))

Data cleaning in R: grouping by number and then by name

A small sample of my dataset looks something like this:
x <- c(1,2,3,4,1,7,1)
y <- c("A","b","a","F","A",".A.","B")
data <- cbind(x,y)
My goal is to first group data that have the same number together and then followed by the same name together (A,a,.A. are considered as the same name for my case).
In other words, the final output should look something like this:
xnew <- c(1,1,3,7,1,2,4)
ynew <- c("A","A","a",".A.","B","b","F")
datanew <- cbind(xnew,ynew)
Currently, I am only able to group by number in the column labelled x. I am unable to group by name yet. I would appreciate any help given.
Note: I need an automated solution as my raw dataset contains over 10,000 lines for the x and y columns.
Assuming what you have is a dataframe data <- data.frame(x,y) and not a matrix which is being generated with cbind you could combine different values into one using fct_collapse and then arrange the data by this new column (z) and x value.
library(dplyr)
library(forcats)
data %>%
mutate(z = fct_collapse(y,
"A" = c('A', '.A.', 'a'),
"B" = c('B', 'b'))) %>%
arrange(z, x) %>%
select(-z) -> result
result
# x y
#1 1 A
#2 1 A
#3 3 a
#4 7 .A.
#5 1 B
#6 2 b
#7 4 F
Or you can remove all the punctuations from y column, make them into upper or lower case and then arrange.
data %>%
mutate(z = toupper(gsub("[[:punct:]]", "", y))) %>%
arrange(z, x) %>%
select(-z) -> result
result
library(dplyr)
data %>%
as.data.frame() %>%
group_by(x, y) %>%
summarise(records = n()) %>%
arrange(x, y)
According to your question it's just a matter of ordering data.
result <- data[order(data$x, data$y),]
or considering that you wan to collate A a .A.
result <- data[order(data$x, toupper(gsub("[^A-Za-z]","",data$y))),]

In R, how do you compare two columns with a regex, row-by row?

I have a data frame (a tibble, actually) df, with two columns, a and b, and I want to filter out the rows in which a is a substring of b. I've tried
df %>%
dplyr::filter(grepl(a,b))
but I get a warning that seems to indicate that R is actually applying grepl with the first argument being the whole column a.
Is there any way to apply a regular expression involving two different columns to each row in a tibble (or data frame)?
If you're only interested in by-row comparisons, you can use rowwise():
df <- data.frame(A=letters[1:5],
B=paste0(letters[3:7],letters[c(2,2,4,3,5)]),
stringsAsFactors=F)
df %>%
rowwise() %>%
filter(grepl(A,B))
A B
1 b db
2 e ge
---------------------------------------------------------------------------------
If you want to know whether row-entry of A is in all of B:
df %>% rowwise() %>% filter(any(grepl(A,df$B)))
A B
1 b db
2 c ed
3 d fc
4 e ge
Or using base R apply and #Chi-Pak's reproducible example
df <- data.frame(A=letters[1:5],
B=paste0(letters[3:7],letters[c(2,2,4,3,5)]),
stringsAsFactors=F)
matched <- sapply(1:nrow(df), function(i) grepl(df$A[i], df$B[i]))
df[matched, ]
Result
A B
2 b db
5 e ge
You can use stringr::str_detect, which is vectorised over both string and pattern. (Whereas, as you noted, grepl is only vectorised over its string argument.)
Using #Chi Pak's example:
library(dplyr)
library(stringr)
df %>%
filter(str_detect(B, fixed(A)))
# A B
# 1 b db
# 2 e ge

tidying data and reshaping key-value to wide format

This is not a easy problem for me, to be honest. I have searched quite a long time but there seems no similar question.
Here's how a few rows and columns of my data looks like:
V1 V2 V3
1 74c1c25f4b283fa74a5514307b0d0278 1#11:2241 1#10:249
2 08f5b445ec6b29deba62e6fd8b0325a6 20#7:249 20#5:83
3 4b7f6f4e2bf237b6cc58f57142bea5c0 4#16:249 24:913
So, the cells are in a format like "class(#subclass):value". I want to make a table like this:
V1 1#10 1#11 4#16 20#5 20#7 24
1 74c1c25f4b283fa74a5514307b0d0278 249 2241 0 0 0 0
2 08f5b445ec6b29deba62e6fd8b0325a6 0 0 0 83 249 0
3 4b7f6f4e2bf237b6cc58f57142bea5c0 0 0 249 0 0 913
Because I haven't met this kind of data structure before, I am not sure if this is the best way to store it. But so far, this is the only table format I could come up with. If you have any suggestion about it, please leave a comment.
Then, I first parsed it as the following:
V1 V2_1_1 V2_1_2 V2_2_1 V3_1_1 V3_1_2 V3_2_1
1 74c1c25f4b283fa74a5514307b0d0278 1 11 2241 1 10 249
2 08f5b445ec6b29deba62e6fd8b0325a6 20 7 249 20 5 83
3 4b7f6f4e2bf237b6cc58f57142bea5c0 4 16 249 24 NA 913
Now, I don't know how to convert it to the table format I want. Any package in R can I use to do it?
two links are attached below
original data: https://www.dropbox.com/s/aqay5dn4r3m3kdp/temp1TrainPoiFile.R?dl=0
parsed data:
https://www.dropbox.com/s/0oj8ic1pd2rew0h/temp3TrainPoiFile.R?dl=0
Thank you very much for you help. Please leave a comment if there's any question about it.
Thanks for Walt's and Jack's answer. I used tidyr to solve the problem. Below is how I did it.
Read file
source("temp1TrainPoiFile.R")
gather columns to key-value pair
temp2TrainPoiFile <- temp1TrainPoiFile %>% gather( key=V1, value=data, -V1)
extract to two columns
temp3TrainPoiFile <- temp2TrainPoiFile %>% extract(col=data, into=c("class","value"), regex="(.*):(.*)")
adding row numbers
row <- 1:nrow(temp3TrainPoiFile)
temp3TrainPoiFile <- cbind(row, temp3TrainPoiFile)
spread key-value to two columns
TrainPoiFile <- temp3TrainPoiFile %>% spread(key=class, value=value, fill=0)
This looks like a good example of the use of the tidyr package. Use gather to transform into a two column data frame using column V1 as the key and the other columns as the value column named data, extract to split the data column into class and value columns, and then spread to use the class column as new column names and the value columns as values. Code would look like:
library(tidyr)
library(dplyr)
class_table <- df %>% mutate(row = 1:nrow(.)) %>%
gather( key=V1, value=data, -c(V1,row)) %>%
extract(col=data, into=c("class","value"), regex="(.*):(.*)") %>%
spread(key=class, value=value, fill=0)
Edited to ensure uniqueness of row identifiers. mutate requires dplyr package.
Read in data
data <- source("temp1TrainPoiFile.R")[[1]]
Proper NAs
data[data == ""] <- NA
Reshape it into long format
data <- do.call(rbind, lapply(split(data, data[,"V1"]), function(n) {
id <- n[,1]
n <- na.omit(unlist(n[,-1]))
n <- strsplit(n, ":")
n <- do.call(rbind, lapply(n, function(m) data.frame(column = m[1], value = m[2])))
n <- data.frame(id = id, n)
n}))
Prepare for loop to insert the values into a newly created matrix
id <- unique(data[,"id"])
column <- unique(data[,"column"])
mat <- matrix(data = NA, nrow = length(id), ncol = length(column))
rownames(mat) <- id
colnames(mat) <- column
Insert the values
for(i in 1:nrow(data)) {
mat[data[i, "id"], data[i, "column"]] <- data[i,"value"]}

Using magrittr and lapply to divide a column in each df in a list by a list of values

I have a list of dataframes containing different time series of different lengths. I want to summarize the count of a variable and then normalize it by the number of years of data that is contained in that particular dataset.
so with a sample dataframe:
data_list <- list(data.frame(temp_bin = rep(1:4, 2:5), value = runif(14)),
data.frame(temp_bin = rep(1:4, 3:6), value = runif(18)),
data.frame(temp_bin = rep(1:4, 4:7), value = runif(22)))
# this might be ~10 different data sets with ~ 100k observations each
count <- lapply(data_list, function(x) {nrow(x)/5} )
# for real data this would be divided by 8760 for the # of hours in a year.
Here is approximately what I want to do, but the n()/count doesn't work because count is a list.
data_bin <- data_list %>%
lapply(., group_by, temp_bin) %>%
lapply(., summarise, n = n()/count)
I tried doing an lapply or mapply within the definition of n, but that didn't seem to work. also tried doing it in two steps - create get a raw n value and then divide in the next step with mapply, but that didn't work either.
If you put the count step in your data_bin step I think it accomplishes what you want, though I am a little hazy on exactly what you mean but I think this works: (Note that you can remove the . assignment from the first argument of lapply, that's the default behavior of %>%)
data_bin <- data_list %>%
lapply(group_by, temp_bin) %>%
# We need x so I put summarize in a manual function
lapply(function(x){summarize(x,n = 5*n()/nrow(x))}) # move the 5 to numerator
data_bin[[1]]
Source: local data frame [4 x 2]
temp_bin n
1 1 0.7142857
2 2 1.0714286
3 3 1.4285714
4 4 1.7857143
Is this what you wanted? You can double check the summarize is part is doing what you want by just returning the nrow(x) result.
data_bin <- data_list %>%
lapply(group_by, temp_bin) %>%
lapply(function(x){summarize(x,n = nrow(x))})
data_bin[[1]]
Source: local data frame [4 x 2]
temp_bin n
1 1 14
2 2 14
3 3 14
4 4 14
I would try to avoid using lapply on every row of a dplyr statement. You could wrap individual data.frame transformation in a function and then lapply that function to data_list
library(dplyr)
ret_db <- function(df) {
db <- df %>%
group_by(.,temp_bin) %>%
summarise(.,n=n()/(nrow(df)/5))
return(db)
}
data_bin <- lapply(data_list,ret_db)

Resources