R Dataframe to list - r

I have a dataframe similar to the below:
v1 v2 v3 v4 v5
a a1 a2 a3 a4
b b1 b2
c c1 c2 c3
I want to convert this to lists as below.
lista <- list(base="a", alts=c("a1","a2","a3","a4"))<br>
listb <- list(base="b", alts=c("b1","b2"))<br>
listb <- list(base="c", alts=c("c1","c2","c3"))
I have looked at solutions posted on here and tried some suggestions, but nothing works?1
Any help will be great! I am still new to R - Cheers

If df contains your data frame, you could try like this:
l <- lapply(as.data.frame(t(df), stringsAsFactors = FALSE),
function(x) {
x <- unname(x)
list(base = x[1], alts = x[-1])
})
names(l) <- paste0("list", df[, 1])
list2env(l, envir = .GlobalEnv)
lista
# $base
# [1] "a"
#
# $alts
# [1] "a1" "a2" "a3" "a4"

Related

How to access the name of a data frames within a list using lapply

library(data.table)
dt <- data.table(V1 = c("Name1", "Name2", "Name3"),
V2 = c(1,2,3),
V3 = c(1,2,3)
)
For reproducibility I've defined the names of the list elements above, but in my data I do not have a list of the datatable names.
I turn the datatable into a list using split:
List <- split(dt, with(dt, interaction(V1)), drop = TRUE)
List
$Name1
V1 V2 V3
1: Name1 1 1
$Name2
V1 V2 V3
1: Name2 2 2
$Name3
V1 V2 V3
1: Name3 3 3
I'm using lapply to manipulate the elements in the list and as part of this I want to access the names of those datatables. names() gives me the variable names of the datatables. How do I reference the names of the datatables?
Listnames <- lapply(List, function(x) {
names(x)
})
Listnames
$Name1
[1] "V1" "V2" "V3"
$Name2
[1] "V1" "V2" "V3"
$Name3
[1] "V1" "V2" "V3"
Do you want this?
lapply(seq_along(List), FUN = function(x) names(List)[x])
[[1]]
[1] "Name1"
[[2]]
[1] "Name2"
[[3]]
[1] "Name3"
purrr's imap was (among other things) invented for this:
library(purrr)
foo <- List %>% imap( function(x,y) {
cat( "x = ", toString(x), "\n" )
cat( "y = ", toString(y), "\n" )
})
Output:
x = Name1, 1, 1
y = Name1
x = Name2, 2, 2
y = Name2
x = Name3, 3, 3
y = Name3
x is the data, y is the name in each iteration

Making new dataframes from old dataframes by column number

I'm trying to re-organize my dataframes by Column orders
for Example
x <- data.frame("A" = c(1,1), "B" = c(2,2), "C" = c(3,3))
y <- data.frame("A" = c(2,2), "B" = c(3,3), "C" = c(4,4))
z <- data.frame("A" = c(3,3), "B" = c(4,4), "C" = c(5,5))
Say I have dataframes as above.
What I want to do is make new dataframes by column orders of those above dataframes. (Simply put, I want to put all the "A"s ,"B"s and "C"s, to 3 new dataframes.
the below dataframes are my wanted results
a <- data.frame("A" = c(1,1), "A" = c(2,2), "A" = c(3,3))
b <- data.frame("B" = c(2,2), "B" = c(3,3), "B" = c(4,4))
c <- data.frame("C" = c(3,3), "C" = c(4,4), "C" = c(5,5))
We can do this with tidyverse
library(tidyverse)
list(x, y, z) %>%
transpose %>%
map(~ do.call(cbind, .x))
Or with base R
lapply(names(x), function(nm) cbind(x[, nm], y[, nm], z[, nm]))
Assuming you have equal number of columns in all the dataframes, one way is to use lapply over list of dataframes and subset them sequentially.
lst1 <- list(x, y, z)
lapply(seq_len(ncol(x)), function(i) cbind.data.frame(lapply(lst1, `[`, i)))
#[[1]]
# A A A
#1 1 2 3
#2 1 2 3
#[[2]]
# B B B
#1 2 3 4
#2 2 3 4
#[[3]]
# C C C
#1 3 4 5
#2 3 4 5
If your dataframes are not already sorted by names you might want to do that first.
lst1 <- lapply(list(x, y, z), function(i) i[order(names(i))])
We can also use purrr using the same logic
library(purrr)
map(seq_len(ncol(x)), ~cbind.data.frame(map(lst1, `[`, .)))

More efficient than nested loop in R

I am trying to break my habit of for loops by using apply but I've gotten stumped on this one. I have a for loop that collapses every two rows into one row for an object, obj.tmp(366 by 34343), but it is slow.
Here's a much shortened example:
df <- data.frame(X1 = letters[1:10], X2 = letters[11:20], stringsAsFactors = FALSE)
Thus:
> df
X1 X2
a k
b l
c m
d n
e o
f p
g q
h r
i s
j t
for(i in 1:(nrow(df)/2)){
df2[i,] <- apply( df[(i*2-1):(i*2),], 2, paste, collapse = "")
}
Output:
> df2
X1 X2
ab kl
cd mn
ef op
gh qr
ij st
Suggestions on a better method?
Based on your sample data, here is one possibility:
# Sample data
df <- data.frame(X1 = letters[1:10], X2 = letters[11:20], stringsAsFactors = FALSE);
do.call(rbind, lapply(split(df, gl(nrow(df) / 2, 2, nrow(df))), function(x) sapply(x, paste0, collapse = "")))
# X1 X2
#1 "ab" "kl"
#2 "cd" "mn"
#3 "ef" "op"
#4 "gh" "qr"
#5 "ij" "st"
Explanation: Split df every two rows and store in list, paste entries by column, and rbind into final object.
If you want to avoid rbinding the list element, you can also do:
t(sapply(split(df, gl(nrow(df) / 2, 2, nrow(df))), function(x) sapply(x, paste0, collapse = "")));
# X1 X2
#1 "ab" "kl"
#2 "cd" "mn"
#3 "ef" "op"
#4 "gh" "qr"
#5 "ij" "st"
We can use the aggregate function:
df1=cbind(df,id=rep(1:(nrow(df)/2)# Create a new df with an id that shows the rows to be combined
aggregate(.~id,df1,each=2)),paste0,collapse="")[-1]#Combine the rows
X1 X2
1 ab kl
2 cd mn
3 ef op
4 gh qr
5 ij st
You can do all this in one line:
aggregate(.~id,cbind(df,id=rep(1:(nrow(df)/2),each=2)),paste0,collapse="")[-1]
You can also try:
matrix(do.call(paste0,data.frame(matrix(unlist(df),,2,T))),,2)
[,1] [,2]
[1,] "ab" "kl"
[2,] "cd" "mn"
[3,] "ef" "op"
[4,] "gh" "qr"
[5,] "ij" "st"
Some thing like this ? If isn't, Can you be more clear? And pass the code to replicate what you are doing. But I hope this solves your problem.
df <- data.frame(X1 = letters[1:10], stringsAsFactors = FALSE)
df2 <- data.frame(X1 = character(), stringsAsFactors = FALSE)
sapply(1:round(nrow(df)/2), FUN = function(x) {
df2[x,] <<- paste(df[(x*2-1):(x*2),], collapse = "")
})
df2

Find an element following a sequence across rows in a data frame

I have a data set with the structure shown below.
# example data set
a <- "a"
b <- "b"
d <- "d"
id1 <- c(a,a,a,a,b,b,d,d,a,a,d)
id2 <- c(b,d,d,d,a,a,a,a,b,b,d)
id3 <- c(b,d,d,a,a,a,a,d,b,d,d)
dat <- rbind(id1,id2,id3)
dat <- data.frame(dat)
I need to find across each row the first sequence with repeated elements "a" and identify the element following the sequence immediately.
# desired results
dat$s3 <- c("b","b","d")
dat
I was able to break the problem in 3 steps and solve the first one but as my programming skills are quite limited, I would appreciate any advice on how to approach steps 2 and 3. If you have an idea that solves the problem in another way that would be extremely helpful as well.
Here is what I have so far:
# Step 1: find the first occurence of "a" in the fist sequence
dat$s1 <- apply(dat, 1, function(x) match(a,x))
# Step 2: find the last occurence in the first sequence
# Step 3: find the element following the last occurence in the first sequence
Thanks in advance!
I'd use filter:
fun <- function(x) {
x <- as.character(x)
isa <- (x == "a") #find "a" values
#find sequences with two TRUE values and the last value FALSE
ids <- stats::filter(isa, c(1,1,1), sides = 1) == 2L & !isa
na.omit(x[ids])[1] #subset
}
apply(dat, 1, fun)
#id1 id2 id3
#"b" "b" "d"
Try this (assuming that you have repeated a at each row):
library(stringr)
dat$s3 <-apply(dat, 1, function(x) str_match(paste(x, collapse=''),'aa([^a])')[,2])
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 s3
id1 a a a a b b d d a a d b
id2 b d d d a a a a b b d b
id3 b d d a a a a d b d d d
Well, here is one attempt which is a bit messy,
l1 <- lapply(apply(dat, 1, function(i) as.integer(which(i == a))),
function(j) j[cumsum(c(1, diff(j) != 1)) == 1])
ind <- unname(sapply(l1, function(i) tail(i, 1) + 1))
dat$s3 <- diag(as.matrix(dat[ind]))
dat$s3
#[1] "b" "b" "d"
or wrap it in a function,
fun1 <- function(df){
l1 <- lapply(apply(df, 1, function(i) as.integer(which(i == a))),
function(j) j[cumsum(c(1, diff(j) != 1)) == 1])
ind <- unname(sapply(l1, function(i) tail(i, 1) + 1))
return(diag(as.matrix(df[ind])))
}
fun1(dat)
#[1] "b" "b" "d"

Replacing certain special characters like # or - with their actual words "at" and "dash" using R

I have been working on some data cleaning program in R and I have run into a problem. I am trying to replace the special characters like "#" with their character counterparts "at".
I have tried sub, gsub and setNames and even replace.
All of these produce the same result: it just gives me a ton of NAs in my data.
I have a sample of what my data looks like just for reference.
Just imagine that I cannot see where all of the # signs are, I want to search the entire data set and replace all of them. My actual dataset has 50 columns so going by column won't work.
################## EDIT ##################
aa <- read.csv("C:/Users/Zander Kent/Documents/Data Cleaning/sample dataset.csv", header = T, na.strings=c("", " ","NA"))
aaa <- data.frame(aa)
abc <- as.data.frame(apply(aaa, 2, function(x) gsub(" # ", "at", x)))
write.csv(abc, file="C:/Users/Zander Kent/Documents/Data Cleaning/clean_2.csv")
link to data in google.drive
Sample Data
I tried one of the answers and it worked on a very small data set 10x10 but when I tried it on my entire dataset it didn't do anything. all of the special characters were still there. There were no error messages the code ran through without any problems.
Lacking a reproducible example...
Let's create a vector with the undesired symbol:
a <- data.frame(x = c("1", "a", "3#"), y = c("5#", "2", "b"))
Now we can use gsub:
as.data.frame(apply(a, 2, function(x) gsub("#", "at", x)))
and obtain:
## x y
## 1 1 5at
## 2 a 2
## 3 3at b
####### EDIT #####
If you want to replace "-" with "dash", then there is a nice function in the qdap package. Let's re-create the vector with the two bad guys:
a <- data.frame(x = c("1-", "a", "3#"), y = c("5#", "2", "b-"))
Then we do:
require(qdap)
as.data.frame(apply(a, 2, function(x) multigsub(c("#", "-"),
c("at", "dash"),
x))
####### EDIT 2 #######
This works, and is pretty big:
x <- sample(LETTERS, 1e6, TRUE)
y <- sample(c("", "", "", "#", "-"), 1e6, TRUE)
a <- data.frame(x, y)
b <- apply(a, 1, function(x) paste(x, collapse = ""))
df <- as.data.frame(matrix(b, ncol=50))
df[1:4, 1:10]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 T V- H L T K# M T# M- I
2 G W F# K# W# T# R X- G G-
3 R E# V O# R D L L C- B
4 T G# J U X H# Q Q T Z#
df2 <- apply(df, 2, function(x) multigsub(c("#", "-"), c("at", "dash"), x))
grep("-", df2)
integer(0)
grep("#", df2)
integer(0)
df2[1:4, 1:10]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
[1,] "T" "Vdash" "H" "L" "T" "Kat" "M" "Tat" "Mdash" "I"
[2,] "G" "W" "Fat" "Kat" "Wat" "Tat" "R" "Xdash" "G" "Gdash"
[3,] "R" "Eat" "V" "Oat" "R" "D" "L" "L" "Cdash" "B"
[4,] "T" "Gat" "J" "U" "X" "Hat" "Q" "Q" "T" "Zat"

Resources