Convert single column dataframe to dataframe with multiple rows and named columns - r

dfOrig <- data.frame(rbind("1",
"C",
"531404",
"3",
"B",
"477644"))
setnames(dfOrig, "Value")
I have a single column vector, which actually comprises two observations of three variables. How do I convert it to a data.frame with the following structure:
ID Code Tag
"1" "C" "531404"
"3" "B" "477644"
Obviously, this is just a toy example to illustrate a real-world problem with many more observations and variables.

Here's another approach - it does rely on the dfOrig column being ordered 1,2,3,1,2,3 etc.
x <- c("ID", "Code", "Tag") # new column names
n <- length(x) # number of columns
res <- data.frame(lapply(split(as.character(dfOrig$Value), rep(x, nrow(dfOrig)/n)),
type.convert))
The resulting data is:
> str(res)
#'data.frame': 2 obs. of 3 variables:
# $ Code: Factor w/ 2 levels "B","C": 2 1
# $ ID : int 1 3
# $ Tag : int 531404 477644
As you can see, the column classes have been converted. In case you want the Code column to be character instead of factor you can specify stringsAsFactors = FALSE in the data.frame call.
And it looks like this:
> res
# Code ID Tag
#1 C 1 531404
#2 B 3 477644
Note: You have to get the column name order in x in line with the order of the entries in dfOrig$Value.
If you want to get the column order of res as specified in x, you can use the following:
res <- res[, match(x, names(res))]

Maybe convert to matrix with nrow:
# set number of columns
myNcol <- 3
# convert to matrix, then dataframe
res <- data.frame(matrix(dfOrig$Value, ncol = myNcol, byrow = TRUE),
stringsAsFactors = FALSE)
# convert the type and add column names
res <- as.data.frame(lapply(res, type.convert),
col.names = c("resID", "Code", "Tag"))
res
# resID Code Tag
# 1 1 C 531404
# 2 3 B 477644

You can create a sequence of numbers
x <- seq(1:nrow(dfOrig)) %% 3 #you can change this 3 to number of columns you need
data.frame(ID = dfOrig$Value[x == 1],
Code = dfOrig$Value[x == 2],
Tag = dfOrig$Value[x == 0])
#ID Code Tag
#1 1 C 531404
#2 3 B 477644
Another approach would be splitting the dataframe according to the sequence generated above and then binding the columns using do.call
x <- seq(1:nrow(dfOrig))%%3
res <- do.call("cbind", split(dfOrig,x))
You can definitely change the column names
colnames(res) <- c("Tag", "Id", "Code")
# Tag Id Code
#3 531404 1 C
#6 477644 3 B

Related

Assign value in dataframe from list by list's element name = dataframe row number

I have a name list, such as the following:
> myNamedList
(...)
$`1870`
[1] 84.24639
$`1871`
[1] 84.59707
(...)
I would like to assign these values in a dataframe's column where the list element's name corresponds to the dataframe's row number. For now I am proceeding like this:
for (element in names(myNamedList)) {
targetDataFrame[as.numeric(element),][[columnName]] = myNamedList[[element]]
}
This is quite slow if the list is somewhat large, and also not very R-esque. I believe I could do something with apply, but am not sure where to look. Appreciate your help.
Add a row number to original data, then stack the list, then merge. See example:
# example
#data
set.seed(1); d <- data.frame(x = sample(LETTERS, 5))
#named list
x <- list("2" = 11, "4" = 22)
#add a row number
d$rowID = seq(nrow(d))
# stack the list, and merge
merge(d, stack(x), by.x = "rowID", by.y = "ind", all.x = TRUE)
# rowID x values
# 1 1 Y NA
# 2 2 D 11
# 3 3 G NA
# 4 4 A 22
# 5 5 B NA

How to look for uniques in other column relatively assign ids

I have a toy example to explain what I am trying to work on :
aski = data.frame(x=c("a","b","c","a","d","d"),y=c("b","a","d","a","b","c"))
I managed to do assigning unique ids to column y and now output looks like:
aski2 = data.frame(x=c("a","b","c","a","d","d"),y=c("1","2","3","2","1","4"))
as you see "b" is present in both col x and y and we assigned an id=1 in col y
and "a" with id=2 in col y and so on..
As you see these values are also present in col x.....
col x has "a" as its first element ."a" was also in col y and assigned an id=2
so I'll assign an id=2 for a in col x also
Now what i m trying to do next is look for these values in col x and if it occurs in col y I assign that id to it
FINAL DATAFRAME LIKE
aski3 = data.frame(x=c("2","1","4","2","3","3"),y=c("1","2","3","2","1","4"))
Without the need to create aski2 as an intermediate, a possible solution is to use match with lapply to get the numeric representations of the letters:
# create a vector of the unique values in the order
# in which you want them assigned to '1' till '4'
v <- unique(aski$y)
# convert both columns to integer values with 'match' and 'lapply'
aski[] <- lapply(aski, match, v)
which gives:
> aski
x y
1 2 1
2 1 2
3 4 3
4 2 2
5 3 1
6 3 4
If you want the number as characters, you can additionally do:
aski[] <- lapply(aski, as.character)
First, convert both columns to character vectors.
Then, collect all unique values from the two columns to use as levels of a factor.
Convert both columns to factors, then numeric.
aski = data.frame(x=c("a","b","c","a","d","d"),y=c("b","a","d","a","b","c"))
aski$x <- as.character(aski$x)
aski$y <- as.character(aski$y)
lev <- unique(c(aski$y, aski$x))
aski$x <- factor(aski$x, levels=lev)
aski$y <- factor(aski$y, levels=lev)
aski$x <- as.numeric(aski$x)
aski$y <- as.numeric(aski$y)
aski
A solution from dplyr. We can first create a vector showing the relationship between index and letter as vec by unique(aski$y). After this step, you can use Jaap's lapply solution, or you can use mutata_all from dplyr as follows.
# Create the vector showing the relationship of index and letter
vec <- unique(aski$y)
# View vec
vec
[1] "b" "a" "d" "c"
library(dplyr)
# Modify all columns
aski2 <- aski %>% mutate_all(funs(match(., vec)))
# View the results
aski2
x y
1 2 1
2 1 2
3 4 3
4 2 2
5 3 1
6 3 4
Data
aski <- data.frame(x = c("a","b","c","a","d","d"),
y = c("b","a","d","a","b","c"),
stringsAsFactors = FALSE)

How can I insert values into a data frame dynamically using R

After scraping some review data from a website, I am having difficulty organizing the data into a useful structure for analysis. The problem is that the data is dynamic, in that each reviewer gave ratings on anywhere between 0 and 3 subcategories (denoted as subcategories "a", "b" and "c"). I would like to organize the reviews so that each row is a different reviewer, and each column is a subcategory that was rated. Where reviewers chose not to rate a subcategory, I would like that missing data to be 'NA'. Here is a simplified sample of the data:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
The vec contains the information of the subcategories that were scored, and the "stop" is the end of each reviewers rating. As such, I would like to organize the result into a data frame with this structure. Expected Output
I would greatly appreciate any help on this, because I've been working on this issue for far longer than it should take me..
#alexis_laz provided what I believe is the best answer:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
stops <- vec == "stop"
i = cumsum(stops)[!stops] + 1L
j = vec[!stops]
tapply(ratings, list(factor(i, 1:max(i)), factor(j)), identity) # although mean/sum work
# a b c
#[1,] 2 5 1
#[2,] 1 3 NA
#[3,] NA NA NA
#[4,] NA NA 2
base R, but I'm using a for loop...
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
categories <- unique(vec)[unique(vec)!="stop"]
row = 1
df = data.frame(lapply(categories, function(x){NA_integer_}))
colnames(df) <- categories
rating = 1
for(i in vec) {
if(i=='stop') {row <- row+1
} else { df[row,i] <- ratings[[rating]]; rating <- rating+1}
}
Here is one option
library(data.table)
library(reshape2)
d1 <- as.data.table(melt(split(vec, c(1, head(cumsum(vec == "stop")+1,
-1)))))[value != 'stop', ratings := ratings
][value != 'stop'][, value := as.character(value)][, L1 := as.integer(L1)]
dcast( d1[CJ(value = value, L1 = seq_len(max(L1)), unique = TRUE), on = .(value, L1)],
L1 ~value, value.var = 'ratings')[, L1 := NULL][]
# a b c
#1: 2 5 1
#2: 1 3 NA
#3: NA NA NA
#4: NA NA 2
Using base R functions and rbind.fill from plyr or rbindlist from data.table to produce the final object, we can do
# convert vec into a list, split by "stop", dropping final element
temp <- head(strsplit(readLines(textConnection(paste(gsub("stop", "\n", vec, fixed=TRUE),
collapse=" "))), split=" "), -1)
# remove empty strings, but maintain empty list elements
temp <- lapply(temp, function(x) x[nchar(x) > 0])
# match up appropriate names to the individual elements in the list with setNames
# convert vectors to single row data.frames
temp <- Map(function(x, y) setNames(as.data.frame.list(x), y),
relist(ratings, skeleton = temp), temp)
# add silly data.frame (single row, single column) for any empty data.frames in list
temp <- lapply(temp, function(x) if(nrow(x) > 0) x else setNames(data.frame(NA), vec[1]))
Now, you can produce the single data.frame (data.table) with either plyr or data.table
# with plyr, returns data.frame
library(plyr)
do.call(rbind.fill, temp)
a b c
1 2 5 1
2 1 3 NA
3 NA NA NA
4 NA NA 2
# with data.table, returns data.table
rbindlist(temp, fill=TRUE)
a b c
1: 2 5 1
2: 1 3 NA
3: NA NA NA
4: NA NA 2
Note that the line prior to the rbinding can be replaced with
temp[lengths(temp) == 0] <- replicate(sum(lengths(temp) == 0),
setNames(data.frame(NA), vec[1]), simplify=FALSE)
where the list items that are empty data frames are replaced using subsetting instead of an lapply over the entire list.

Merging columns with overlapping data in R data frames

a<-data.frame(cbind("Sample"=c("100","101","102","103"),"Status"=c("Y","","","partial")))
b<-data.frame(cbind("Sample"=c("100","101","102","103","106"),"Status"=c("NA","Y","","","Y")))
desired<-data.frame(cbind("Sample"=c("100","101","102","103","106"),"Status"=c("Y","Y","","partial","Y")))
I have sample processing data in multiple sources and I'd like to combine them into a master list. How can I merge the "Status" column between 2 data frames such that a overrules b in order to collate "Y" and "partial" for each sample? Thank you in advance.
require(data.table)
a<-data.table(cbind("Sample"=c("100","101","102","103"),"Status"=c("Y","","","partial")))
b<-data.table("Sample"=c("100","101","102","103","106"),"Status"=c("NA","Y","","","Y"))
c <- merge(a, b, by = "Sample", all=TRUE)
c[,Status := ifelse(!is.na(Status.x), Status.x, Status.y)]
c[,`:=` (Status.x=NULL, Status.y = NULL)]
I assume you want to keep the values from a and b with an order of priority, Y covers partial that covers NA that covers nothing.
d <- merge(a,b,by="Sample",all=TRUE)
d$Status <- ""
d$Status[apply(c,1,function(x){any(is.na(x))})] <- "" # cleaning the NAs I introduced with the merge
d$Status[apply(c,1,`%in%`, x = "NA")] <- NA # or "NA" if you want to keep it this way, or "" if you want to get rid of them
d$Status[apply(c,1,`%in%`, x = "partial")] <- "partial"
d$Status[apply(c,1,`%in%`, x = "Y")] <- "Y"
d <- d[,c(1,4)]
# Sample Status
# 1 100 Y
# 2 101 Y
# 3 102
# 4 103 partial
# 5 106 Y

rename column in dataframe using variable name R

I have a number of data frames. Each with the same format.
Like this:
A B C
1 -0.02299388 0.71404158 0.8492423
2 -1.43027866 -1.96420767 -1.2886368
3 -1.01827712 -0.94141194 -2.0234436
I would like to change the name of the third column--C--so that it includes part if the name of the variable name associated with the data frame.
For the variable df_elephant the data frame should look like this:
A B C.elephant
1 -0.02299388 0.71404158 0.8492423
2 -1.43027866 -1.96420767 -1.2886368
3 -1.01827712 -0.94141194 -2.0234436
I have a function which will change the column name:
rename_columns <- function(x) {
colnames(x)[colnames(x)=='C'] <-
paste( 'C',
strsplit (deparse (substitute(x)), '_')[[1]][2], sep='.' )
return(x)
}
This works with my data frames. However, I would like to provide a list of data frames so that I do not have to call the function multiple times by hand. If I use lapply like so:
lapply( list (df_elephant, df_horse), rename_columns )
The function renames the data frames with an NA rather than portion of the variable name.
[[1]]
A B C.NA
1 -0.02299388 0.71404158 0.8492423
2 -1.43027866 -1.96420767 -1.2886368
3 -1.01827712 -0.94141194 -2.02344361
[[2]]
A B C.NA
1 0.45387054 0.02279488 1.6746280
2 -1.47271378 0.68660595 -0.2505752
3 1.26475917 -1.51739927 -1.3050531
Is there some way that I kind provide a list of data frames to my function and produce the desired result?
You are trying to process the data frame column names instead of the actual lists' name. And this is why it's not working.
# Generating random data
n = 3
item1 = data.frame(A = runif(n), B = runif(n), C = runif(n))
item2 = data.frame(A = runif(n), B = runif(n), C = runif(n))
myList = list(df_elephant = item1, df_horse = item2)
# 1- Why your code doesnt work: ---------------
names(myList) # This will return the actual names that you want to use : [1] "df_elephant" "df_horse"
lapply(myList, names) # This will return the dataframes' column names. And thats why you are getting the "NA"
# 2- How to make it work: ---------------
lapply(seq_along(myList), # This will return an array of indicies
function(i){
dfName = names(myList)[i] # Get the list name
dfName.animal = unlist(strsplit(dfName, "_"))[2] # Split on underscore and take the second element
df = myList[[i]] # Copy the actual Data frame
colnames(df)[colnames(df) == "C"] = paste("C", dfName.animal, sep = ".") # Change column names
return(df) # Return the new df
})
# [[1]]
# A B C.elephant
# 1 0.8289368 0.06589051 0.2929881
# 2 0.2362753 0.55689663 0.4854670
# 3 0.7264990 0.68069346 0.2940342
#
# [[2]]
# A B C.horse
# 1 0.08032856 0.4137106 0.6378605
# 2 0.35671556 0.8112511 0.4321704
# 3 0.07306260 0.6850093 0.2510791
You can also try. Somehow similar to Akrun's answer using also Map in the end:
# Your data
d <- read.table("clipboard")
# create a list with names A and B
d_list <- list(A=d, B=d)
# function
foo <- function(x, y){
gr <- which(colnames(x) == "C") # get index of colnames C
tmp <- colnames(x) #new colnames vector
tmp[gr] <- paste(tmp[gr], y, sep=".") # replace the old with the new colnames.
setNames(x, tmp) # set the new names
}
# Result
Map(foo, d_list, names(d_list))
$A
A B C.A
1 -0.02299388 0.7140416 0.8492423
2 -1.43027866 -1.9642077 -1.2886368
3 -1.01827712 -0.9414119 -2.0234436
$B
A B C.B
1 -0.02299388 0.7140416 0.8492423
2 -1.43027866 -1.9642077 -1.2886368
3 -1.01827712 -0.9414119 -2.0234436
We can try with Map. Get the datasets in a list (here we used mget to return the values of the strings in a list), using Map, we change the names of the third column with that of the corresponding vector of names.
Map(function(x, y) {names(x)[3] <- paste(names(x)[3], sub(".*_", "", y), sep="."); x},
mget(c("df_elephant", "df_horse")), c("df_elephant", "df_horse"))
#$df_elephant
# A B C.elephant
#1 -0.02299388 0.7140416 0.8492423
#2 -1.43027866 -1.9642077 -1.2886368
#3 -1.01827712 -0.9414119 -2.0234436
#$df_horse
# A B C.horse
#1 0.4538705 0.02279488 1.6746280
#2 -1.4727138 0.68660595 -0.2505752
#3 1.2647592 -1.51739927 -1.3050531

Resources