Unlist column to create unique row in dataframe - r

I am faced with the following R transformation issue.
I have the following dataframe:
test_df <- structure(list(word = c("list of XYZ schools",
"list of basketball", "list of usa"), results = c("58", "151", "29"), key_list = structure(list(`coRq,coG,coQ,co7E,coV98` = c("coRq", "coG", "coQ", "co7E", "coV98"), `coV98,coUD,coHF,cobK,con7` = c("coV98","coUD", "coHF", "cobK", "con7"), `coV98,coX7,couC,coD3,copW` = c("coV98", "coX7", "couC", "coD3", "copW")), .Names = c("coRq,coG,coQ,co7E,coV98", "coV98,coUD,coHF,cobK,con7", "coV98,coX7,couC,coD3,copW"))), .Names = c("word", "results", "key_list"), row.names = c(116L, 150L, 277L), class = "data.frame")
In short there are three columns, unique on "word" and then a corresponding "key_list" that has a list of keys comma separated. I am interested in creating a new data frame where each key is unique and the word information is duplicated as well as the result information.
So a dataframe that looks as follows:
key word results
coV98 "list of XYZ schools" 58
coRq "list of XYZ schools" 58
coV98 "list of basketball" 151
coV98 "list of usa" 29
And so on for all the keys, so I would like to expand the keys unlist them and then reshape into a dataframe with repeating words and other columns.
I have tried a bunch of the following:
Created a unique list of keys and then attempted to grep for each of those keys in the column and loop through to create a new smaller dataframe and then rbind those together, the resulting dataframe however does not contain the key column:
keys <- as.data.frame(table(unname(unlist(test_df$key_list))))
ttt <- lapply(keys, function(xx){
idx <- grep(xx, test_df$key_list)
df <- all_data_sub[idx,]})
final_df <- do.call(rbind, ttt)
I have also played around with unlisting and reshaping, but I am not getting the right combination.
Any advice would be great!
thanks

May be we can use listCol_l from splitstackshape
library(splitstackshape)
listCol_l(test_df, 'key_list')[]

In case a base R solution is helpful for someone:
do.call(rbind, lapply(seq_along(test_df$key_list), function(i) {
merge(test_df$key_list[[i]], test_df[i,-3], by=NULL)
}))

Related

How to convert a dataframe in long format into a list of an appropriate format?

I have a dataframe in the following long format:
I need to convert it into a list which should look something like this:
Wherein, each of the main element of the list would be the "Instance No." and its sub-elements should contain all its corresponding Parameter & Value pairs - in the format of "Parameter X" = "abc" as you can see in the second picture, listed one after the other.
Is there any existing function which can do this? I wasn't really able to find any. Any help would be really appreciated.
Thank you.
A dplyr solution
require(dplyr)
df_original <- data.frame("Instance No." = c(3,3,3,3,5,5,5,2,2,2,2),
"Parameter" = c("age", "workclass", "education", "occupation",
"age", "workclass", "education",
"age", "workclass", "education", "income"),
"Value" = c("Senior", "Private", "HS-grad", "Sales",
"Middle-aged", "Gov", "Hs-grad",
"Middle-aged", "Private", "Masters", "Large"),
check.names = FALSE)
# the split function requires a factor to use as the grouping variable.
# Param_Value will be the properly formated vector
df_modified <- mutate(df_original,
Param_Value = paste0(Parameter, "=", Value))
# drop the parameter and value columns now that the data is contained in Param_Value
df_modified <- select(df_modified,
`Instance No.`,
Param_Value)
# there is now a list containing dataframes with rows grouped by Instance No.
list_format <- split(df_modified,
df_modified$`Instance No.`)
# The Instance No. is still in each dataframe. Loop through each and strip the column.
list_simplified <- lapply(list_format,
select, -`Instance No.`)
# unlist the remaining Param_Value column and drop the names.
list_out <- lapply(list_simplified ,
unlist, use.names = F)
There should now be a list of vectors formatted as requested.
$`2`
[1] "age=Middle-aged" "workclass=Private" "education=Masters" "income=Large"
$`3`
[1] "age=Senior" "workclass=Private" "education=HS-grad" "occupation=Sales"
$`5`
[1] "age=Middle-aged" "workclass=Gov" "education=Hs-grad"
The posted data.table solution is faster, but I think this is a bit more understandable.
require(data.table)
your_dt <- data.table(your_df)
dt_long <- melt.data.table(your_dt, id.vars='Instance No.')
class(dt_long) # for debugging
dt_long[, strVal:=paste(variable,value, sep = '=')]
result_list <- list()
for (i in unique(dt_long[['Instance No.']])){
result_list[[as.character(i)]] <- dt_long[`Instance No.`==i, strVal]
}
Just for reference. Here is the R base oneliner to do this. df is your dataframe.
l <- lapply(split(df, list(df["Instance No."])),
function(x) paste0(x$Parameter, "=", x$Value))

How to change values in unnamed first column

How do I change the entries of the first column in the matrix returned by read_csv if it doesn't have a header?
My variables currently looks like this:
PostFC C1Mean
WBGene00001816 2.475268e-01 415.694457
WBGene00001817 4.808575e+00 2451.018711
and I'd like to rename WBGene0000XXXX to XXXX.
If the first column is actually the rownames do the following
rownames(data) <- gsub(pattern = "WBGene0000", replacement = "", x = rownames(data))
If it isn't consistent, you may want to consider the stringr package and use the substr function
But if it is actually a vector with no header column, I do not know how to reference it without knowing the structure of the data.
run the str function of the data set and see what it returns. Or do the following as a test
colnames(data)[1] <- "test"
Can't exactly help until we know how you have a "zero-length" variable name
If I understand your question correctly the first "unnamed" column you describe are rownames and are not actually in you data.frame
# Example data
df = data.frame(PostFC = c(2.475268e-01, 4.808575e+00), C1Mean = c(415.694457, 2451.018711) )
rownames(df) = c("WBGene00001816", "WBGene00001817")
df
# PostFC C1Mean
# WBGene00001816 0.2475268 415.6945
# WBGene00001817 4.8085750 2451.0187
# change rownames
rownames(df) = c("rowname1", "rowname2")
df
# PostFC C1Mean
# rowname1 0.2475268 415.6945
# rowname2 4.8085750 2451.0187
The entries addressed are actually row names. We can access them with rownames(.).
rownames(df1)
# [1] "WBGene00001816" "WBGene00001817" "WBGene00001818" "WBGene00001819"
# [5] "WBGene00001820" "WBGene00001821" "WBGene00001822"
In R also implemented is rownames<-, i.e. we can assign new rownames by doing rownames(.) <- c(.).
Now in your case it looks like if you want to keep just the last four digits. We may use substring here, which we tell from which digit it should extract. In our case it is the 11th digit to the last, so we do:
rownames(df1) <- substring(rownames(df1), 11)
df1
# PostFC C1Mean
# 1816 0.36250598 2.1073145
# 1817 0.51068402 0.4186838
# 1818 -0.96837330 -0.7239156
# 1819 0.02331745 -0.5902216
# 1820 -0.56927945 1.7540356
# 1821 -0.51252943 0.1343385
# 1822 0.47263180 1.4366233
Note, that duplicated row names are not allowed, i.e. if you obtain duplicates applying this method it will yield an error.
Data used
df1 <- structure(list(PostFC = c(0.362505982864934, 0.510684020059692,
-0.968373302351162, 0.0233174467410604, -0.56927945273647, -0.512529427359891,
0.472631804850333), C1Mean = c(2.10731450148575, 0.418683823183885,
-0.723915648073638, -0.590221641040516, 1.75403562218217, 0.134338480077884,
1.43662329542089)), class = "data.frame", row.names = c("1816",
"1817", "1818", "1819", "1820", "1821", "1822"))

Subset strings in R

One of the strings in my vector (df$location1) is the following:
Potomac, MD 20854\n(39.038266, -77.203413)
Rest of the data in the vector follow same pattern. I want to separate each component of the string into a separate data element and put it in new columns like: df$city, df$state, etc.
So far I have been able to isolate the lat. long. data into a separate column by doing the following:
df$lat.long <- gsub('.*\\\n\\\((.*)\\\)','\\\1',df$location1)
I was able to make it work by looking at other codes online but I don't fully understand it. I understand the regex pattern but don't understand the "\\1" part. Since I don't understand it in full I have been unable to use it to subset other parts of this same string.
What's the best way to subset data like this?
Is using regex a good way to do this? What other ways should I be looking into?
I have looked into splitting the string after a comma, subset using regex, using scan() function and to many other variations. Now I am all confused. Thx
We can also use the separate function from the tidyr package (part of the tidyverse package).
library(tidyverse)
# Create example data frame
dat <- data.frame(Data = "Potomac, MD 20854\n(39.038266, -77.203413)",
stringsAsFactors = FALSE)
dat
# Data
# 1 Potomac, MD 20854\n(39.038266, -77.203413)
# Separate the Data column
dat2 <- dat %>%
separate(Data, into = c("City", "State", "Zip", "Latitude", "Longitude"),
sep = ", |\\\n\\(|\\)|[[:space:]]")
dat2
# City State Zip Latitude Longitude
# 1 Potomac MD 20854 39.038266 -77.203413
You can try strsplit or data.table::tstrsplit(strsplit + transpose):
> x <- 'Potomac, MD 20854\n(39.038266, -77.203413)'
> data.table::tstrsplit(x, ', |\\n\\(|\\)')
[[1]]
[1] "Potomac"
[[2]]
[1] "MD 20854"
[[3]]
[1] "39.038266"
[[4]]
[1] "-77.203413"
More generally, you can do this:
library(data.table)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
The pattern ', |\\n\\(|\\)' tells tstrsplit to split by ", ", "\n(" or ")".
In case you want to sperate state and zip and cite names may contain spaces, You can try a two-step way:
# original split (keep city names with space intact)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
# split state and zip
df[c('state', 'zip')] <- tstrsplit(df$state, ' ')
Here is an option using base R
read.table(text= trimws(gsub(",+", " ", gsub("[, \n()]", ",", dat$Data))),
header = FALSE, col.names = c("City", "State", "Zip", "Latitude", "Longitude"),
stringsAsFactors = FALSE)
# City State Zip Latitude Longitude
#1 Potomac MD 20854 39.03827 -77.20341
So this process might be a little longer, but for me it makes things clear. As opposed to using breaks, below I identify values by using a specific regex for each value I want. I make a vector of regex to extract each value, a vector for the variable names, then use a loop to extract and create the dataframe from those vectors.
library(stringi)
library(dplyr)
library(purrr)
rgexVec <- c("[\\w\\s-]+(?=,)",
"[A-Z]{2}",
"\\d+(?=\\n)",
"[\\d-\\.]+(?=,)",
"[\\d-\\.]+(?=\\))")
varNames <- c("city",
"state",
"zip",
"lat",
"long")
map2_dfc(varNames, rgexVec, function(vn, rg) {
extractedVal <- stri_extract_first_regex(value, rg) %>% as.list()
names(extractedVal) <- vn
extractedVal %>% as_tibble()
})
\\1 is a back reference in regex. It is similar to a wildcard (*) that will grab all instances of your search term, not just the first one it finds.

Initialize an empty tibble with column names and 0 rows

I have a vector of column names called tbl_colnames.
I would like to create a tibble with 0 rows and length(tbl_colnames) columns.
The best way I've found of doing this is...
tbl <- as_tibble(data.frame(matrix(nrow=0,ncol=length(tbl_colnames)))
and then I want to name the columns so...
colnames(tbl) <- tbl_colnames.
My question: Is there a more elegant way of doing this?
something like tbl <- tibble(colnames=tbl_colnames)
my_tibble <- tibble(
var_name_1 = numeric(),
var_name_2 = numeric(),
var_name_3 = numeric(),
var_name_4 = numeric(),
var_name_5 = numeric()
)
Haven't tried, but I guess it works too if instead of initiating numeric vectors of length 0 you do it with other classes (for example, character()).
This SO question explains how to do it with other R libraries.
According to this tidyverse issue, this won't be a feature for tribbles.
Since you want to combine a list of tibbles. You can just assign NULL to the variable and then bind_rows with other tibbles.
res = NULL
for(i in tibbleList)
res = bind_rows(res,i)
However, a much efficient way to do this is
bind_rows(tibbleList) # combine all tibbles in the list
For anyone still interested in an elegant way to create a 0-row tibble with column names given by a character vector tbl_colnames:
tbl_colnames %>% purrr::map_dfc(setNames, object = list(logical()))
or:
tbl_colnames %>% purrr::map_dfc(~tibble::tibble(!!.x := logical()))
or:
tbl_colnames %>% rlang::rep_named(list(logical())) %>% tibble::as_tibble()
This, of course, results in each column being of type logical.
The following command will create a tibble with 0 row and variables (columns) named with the contents of tbl_colnames
tbl <- tibble::tibble(!!!tbl_colnames, .rows = 0)
You could abuse readr::read_csv, which allow to read from string. You can control names and types, e.g.:
tbl_colnames <- c("one", "two", "three", "c4", "c5", "last")
read_csv("\n", col_names = tbl_colnames) # all character type
read_csv("\n", col_names = tbl_colnames, col_types = "lcniDT") # various types
I'm a bit late to the party, but for future readers:
as_tibble(matrix(nrow = 0, ncol = length(tbl_colnames)), .name_repair = ~ tbl_colnames)
.name_repair allows you to name you columns within the same function.

Summing over rows containing particular strings in R

I have a dataframe where the first column contains names of campaigns. I need to sum up all rows where the campaign names contain certain strings (it can appear in different places within the name, i.e. sometimes in the beginning sometimes in the end). The dataframe looks something like this:
Campaign Impressions
1 Local display 1661246
2 Local text 1029724
3 National display 325832
4 National Audio 498900
5 Audio local 597339
6 TV Regional 597339
...
So in this case I want to sum up all rows containing "local" in to one row, "national" into one, "regional" into one etc, like this:
Campaign Impressions
1 Local 939293929
2 National 9232423423
2 Regional 1123123123
How can this be achieved? I've been trying with ddply without success....
You could use grep to find the rows that match the Campaign column categories ('Local', 'National', 'Regional') in a loop (lapply). Subset the dataset ('df') based on grep and sum the 'Impressions' column and rbind the list elements.
res1 <- do.call(rbind,lapply(c('Local', 'National', 'Regional'),
function(x) {
x1 <- df[grep(x, df$Campaign, ignore.case=TRUE),]
data.frame(Campaign= x, Impressions=sum(x1$Impressions))}))
Or use data.table. Keep only the 'Local', 'National', 'Region' in the 'Category' using sub and use that as "grouping" variable to sum the column 'Impressions'.
library(data.table)
setDT(df)[, list(Impressions=sum(Impressions)),by=
list(Category=sub('.*?(Local|National|Region).*','\\U\\1', Campaign,
ignore.case=TRUE, perl=TRUE))]
data
df <- structure(list(Campaign = c("Local display", "Local text",
"National display",
"National Audio", "Audio local", "TV Regional"), Impressions =
c(1661246L, 1029724L, 325832L, 498900L, 597339L, 597339L)), .Names =
c("Campaign", "Impressions"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
I guess you should use the grep function : say your data.frame is called mydata then
Local = grep(mydata$Campaign, pattern = "Local")
National = grep(mydata$Campaign, pattern = "National")
Regional = grep(mydata$Campaign, pattern = "Regional")
mydata_sum = data.frame(Campaign = c("Local", "National", "Regional"), Impressions = c(sum(mydata$Impressions[Local]), sum(mydata$Impressions[National]), sum(mydata$Impressions[Regional])))
Here's my approach using dplyr:
library(dplyr)
library(stringr)
categories <- "Local|National|Regional"
mydf %>%
mutate(Campaign = tolower(str_extract((Campaign), ignore.case(categories)))) %>%
group_by(Campaign) %>%
summarise(sum(Impressions))
I needed to add the tolower, after extracting the strings, to make sure the group_by groups "local" together with "Local".

Resources