I need to Convert Column Values into Row Names using R.
For example to convert format1 into format2
var<-c("Id", "Name", "Score", "Id", "Score", "Id", "Name")
num<-c(1, "Tom", 4, 2, 7, 3, "Jim")
format1<-data.frame(var, num)
format1
var num
1 Id 1
2 Name Tom
3 Score 4
4 Id 2
5 Score 7
6 Id 3
7 Name Jim
Be careful, there are missing values in the format1,and that's the challenge, I guess.
Id<-c(1, 2, 3)
Name<-c("Tom", NA, "Jim")
Score<-c(4, 7, NA)
format2<-data.frame(Id, Name, Score)
format2
Id Name Score
1 1 Tom 4
2 2 <NA> 7
3 3 Jim NA
# How to convert format1 into format2?
I may not articulate in the exact way, however, you can refer to the toy data i give above.
I know a litter bit about reshape and reshape2, however, I failed in converting the data format using both of them.
format1$ID <- cumsum(format1$var == "Id")
format2 <- reshape(format1, idvar = "ID",timevar = "var", direction = "wide")[-1]
names(format2) <- gsub("num.", "", names(format2)
# Id Name Score
# 1 1 Tom 4
# 4 2 <NA> 7
# 6 3 Jim <NA>
Alternatively, if you'd like to skip the gsub() step, you could directly specify the output column names via the varying argument:
reshape(format1, idvar = "ID",timevar = "var", direction = "wide",
varying = list(c("Id", "Name", "Score")))[-1]
You can use dcast after adding an identifier column.
format1$pk <- cumsum( format1$var=="Id" )
library(reshape2)
dcast( format1, pk ~ var, value.var="num" )
Related
I have a large dataset in the form shown below:
ID
Scores
1
English 3, French 7, Geography 8
2
Spanish 7, Classics 4
3
Physics 5, English 5, PE 7, Art 4
I need to parse the text string from the Scores column into separate columns for each subject with the scores for each individual stored as the data values, as below:
ID
English
French
Geography
Spanish
Classics
Physics
PE
Art
1
3
7
8
-
-
-
-
-
2
-
-
-
7
4
-
-
-
3
5
-
-
-
-
5
7
4
I cannot manually predefine the columns as there are 100s in the full dataset. So far I have cleaned the data to remove inconsistent capitalisation and separated each subject-mark pairing into a distinct column as follows:
df$scores2 <- str_to_lower(df$Scores)
split <- separate(
df,
scores2,
into = paste0("Subject", 1:8),
sep = "\\,",
remove = FALSE,
convert = FALSE,
extra = "warn",
fill = "warn",
)
I have looked at multiple questions on the subject, such as Split irregular text column into multiple columns in r, but I cannot find another case where the column titles and data values are mixed in the text string. How can I generate the full set of columns required and then populate the data value?
You can first strsplit the Scores column to split on subject-score pairs (which would be in a list), then unnest the list-column into rows. Then separate the subject-score pairs into Subject and Score columns. Finally transform the data from a "long" format to a "wide" format.
Thanks #G. Grothendieck for improving my code:)
library(tidyverse)
df %>%
separate_rows(Scores, sep = ", ") %>%
separate(Scores, sep = " ", into = c("Subject", "Score")) %>%
pivot_wider(names_from = "Subject", values_from = "Score")
# A tibble: 3 × 9
ID English French Geography Spanish Classics Physics PE Art
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 3 7 8 NA NA NA NA NA
2 2 NA NA NA 7 4 NA NA NA
3 3 5 NA NA NA NA 5 7 4
Using data.table
library(data.table)
setDT(dt)
dt <- dt[, .(class_grade = unlist(str_split(Scores, ", "))), by = ID]
dt[, c("class", "grade") := tstrsplit(class_grade, " ")]
dcast(dt, ID ~ class, value.var = c("grade"), sep = "")
Results
# ID Art Classics English French Geography PE Physics Spanish
# 1: 1 <NA> <NA> 3 7 8 <NA> <NA> <NA>
# 2: 2 <NA> 4 <NA> <NA> <NA> <NA> <NA> 7
# 3: 3 4 <NA> 5 <NA> <NA> 7 5 <NA>
Data
dt <- structure(list(ID = 1:3, Scores = c("English 3, French 7, Geography 8",
"Spanish 7, Classics 4", "Physics 5, English 5, PE 7, Art 4")), row.names = c(NA,
-3L), class = c("data.frame"))
Say we have a data frame which looks something like
df <- data.frame(x_A = c(1, 2), x_B = c(3, 4), y_A = c(5, 6), y_B = c(7, 8))
df
x_A x_B y_A y_B
1 1 3 5 7
2 2 4 6 8
Using library(dplyr) I'm wondering why passing names_sep = "_" in pivot_longer yields a different result than names_sep = 2, as seen in the following
pivot_longer(df, x_A:y_B, names_to = c("name1", "name2"), names_sep = "_")
# A tibble: 8 x 3
name1 name2 value
<chr> <chr> <dbl>
1 x A 1
2 x B 3
3 y A 5
4 y B 7
5 x A 2
6 x B 4
7 y A 6
8 y B 8
pivot_longer(df, x_A:y_B, names_to = c("name1", "name2"), names_sep = 2)
# A tibble: 8 x 3
name1 name2 value
<chr> <chr> <dbl>
1 x_ A 1
2 x_ B 3
3 y_ A 5
4 y_ B 7
5 x_ A 2
6 x_ B 4
7 y_ A 6
8 y_ B 8
When passing a string with the character to break on, that character itself is dropped. When passing the index of the character, it is not. Could someone explain why there is a difference?
From the online doc: "names_sep takes the same specification as separate(), and can either be a numeric vector (specifying positions to break on), or a single string (specifying a regular expression to split on)".
Note the difference in wording "split" for regular expression and "position to break" for numerics. So with a regular expression, the split string is taken as a word separator (and not included in the output column names). With a numeric, there is no "separator" and all characters in the original column name appear in the output.
As #KonradRudolph says, this is intuitive. If you want the separator to appear in the output when using a regex, you have an inconsistency: does the separator get associated with the "name" to its left or to its right? You can only resolve that by convention (which is sub-optimal) or an additional parameter (which - to me - is unnecessary and overly complicated).
Different results, but documented and intentional. I wouldn't desribe the behaviour as inconsistent.
Suppose I have a DT as -
id values valid_types
1 2|3 100|200
2 4 200
3 2|1 500|100
The valid_types tells me what are the valid types I need. There are 4 total types(100, 200, 500, 2000). An entry specifies their valid types and their corresponding values with | separated character values.
I want to transform this to a DT which has the types as columns and their corresponding values.
Expected:
id 100 200 500
1 2 3 NA
2 NA 4 NA
3 1 NA 2
I thought I could take both the columns and split them on | which would give me two lists. I would then combine them by setting the keys as names of the types list and then convert the final list to a DT.
But the idea I came up with is very convoluted and not really working.
Is there a better/easier way to do this ?
Here is another data.table approach:
dcast(
DT[, lapply(.SD, function(x) strsplit(x, "\\|")[[1L]]), by = id],
id ~ valid_types, value.var = "values"
)
Using tidyr library you can use separate_rows with pivot_wider :
library(tidyr)
df %>%
separate_rows(values, valid_types, sep = '\\|', convert = TRUE) %>%
pivot_wider(names_from = valid_types, values_from = values)
# id `100` `200` `500`
# <int> <int> <int> <int>
#1 1 2 3 NA
#2 2 NA 4 NA
#3 3 1 NA 2
A data.table way would be :
library(data.table)
library(splitstackshape)
setDT(df)
dcast(cSplit(df, c('values', 'valid_types'), sep = '|', direction = 'long'),
id~valid_types, value.var = 'values')
I have a data frame that looks like this:
head(df)
shotchart
1 BMMMBMMBMMBM
2 MMMBBMMBBMMB
3 BBBBMMBMMMBB
4 MMMMBBMMBBMM
Different patterns of the letter 'M' are worth certain values such as the following:
MM = 1
MMM = 2
MMMM = 3
I want to create an extra column to this data frame that calculates the total value of the different patterns of 'M' in each row individually.
For example:
head(df)
shotchart score
1 BMMMBMMBMMBM 4
2 MMMBBMMBBMMB 4
3 BBBBMMBMMMBB 3
4 MMMMBBMMBBMM 5
I can't seem to figure out how to assign the values to the different 'M' patterns.
I tried using the following code but it didn't work:
df$score <- revalue(df$scorechart, c("MM"="1", "MMM"="2", "MMMM"="3"))
We create a named vector ('nm1'), split the 'shotchart' to extract only 'M' and then use the named vector to change the values to get the sum
nm1 <- setNames(1:3, strrep("M", 2:4))
sapply(strsplit(gsub("[^M]+", ",", df$shotchart), ","),
function(x) sum(nm1[x[nzchar(x)]], na.rm = TRUE))
Or using tidyverse
library(tidyverse)
df %>%
mutate(score = str_extract_all(shotchart, "M+") %>%
map_dbl(~ nm1[.x] %>%
sum(., na.rm = TRUE)))
# shotchart score
#1 BMMMBMMBMMBM 4
#2 MMMBBMMBBMMB 4
#3 BBBBMMBMMMBB 3
#4 MMMMBBMMBBMM 5
You can also split on "B" and base the result on the count of "M" characters -1 as follows:
df <- data.frame(shotchart = c("BMMMBMMBMMBM", "MMMBBMMBBMMB", "BBBBMMBMMMBB", "MMMMBBMMBBMM"),
score = NA_integer_,
stringsAsFactors = F)
df$score <- lapply(strsplit(df$shotchart, "B"), function(i) sum((nchar(i)-1)[(nchar(i)-1)>0]))
# shotchart score
#1 BMMMBMMBMMBM 4
#2 MMMBBMMBBMMB 4
#3 BBBBMMBMMMBB 3
#4 MMMMBBMMBBMM 5
My data looks like below:
df <- structure(list(V1 = structure(c(7L, 4L, 8L, 8L, 5L, 3L, 1L, 1L,
2L, 1L, 6L), .Label = c("", "cell and biogenesis;transport",
"differentiation;metabolic process;regulation;stimulus", "MAPK cascade;cell and biogenesis",
"MAPK cascade;cell and biogenesis;transport", "metabolic process;regulation;stimulus;transport",
"mRNA;stimulus;transport", "targeting"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-11L))
I want to count how many similar strings are there but also have a track from which row they come from. Each string is separated by a ; but they belong to the row that they are in there.
I want to have the output like this:
String Count position
mRNA 1 1
stimulus 3 1,6,11
transport 4 1,5,9,11
MAPK cascade 2 2,5
cell and biogenesis 3 2,5,9
targeting 2 3,4
regulation of mRNA stability 1 1
regulation 2 6,11
differentiation 1 6,11
metabolic process 2 6,11
The count shows how many times each of the string (the string are separated by a semicolon) is repeated in the entire data.
Second column shows where they were, for example mRNA was only in the first row so it is 1. stimulus was in three rows 1 and 6 and 11
Some rows are blank and they are also count as rows.
In the code below we do the following:
Add the row numbers as a column.
Use strplit to split each string into its components and store the result in a column called string.
strsplit returns a list. We use unnest to stack the list components to create a "long" data frame, giving us a "tidy" data frame that's ready to summarize.
Group by string and return a new data frame that counts the frequency of each string and gives the original row number in which each instance of the string originally appeared.
library(tidyverse)
df$V1 = as.character(df$V1)
df %>%
rownames_to_column() %>%
mutate(string = strsplit(V1, ";")) %>%
unnest %>%
group_by(string) %>%
summarise(count = n(),
rows = paste(rowname, collapse=","))
string count rows
1 cell and biogenesis 3 2,5,9
2 differentiation 1 6
3 MAPK cascade 2 2,5
4 metabolic process 2 6,11
5 mRNA 1 1
6 regulation 2 6,11
7 stimulus 3 1,6,11
8 targeting 2 3,4
9 transport 4 1,5,9,11
If you plan to do further processing on the row numbers, you might want to keep them as numeric values, rather than as a string of pasted values. In that case, you could do this:
df.new = df %>%
rownames_to_column("rows") %>%
mutate(string = strsplit(V1, ";")) %>%
select(-V1) %>%
unnest
This will give you a long data frame with one row for each combination of string and rows.
A base R approach:
# convert 'V1' to a character vector (only necessary of it isn't already)
df$V1 <- as.character(df$V1)
# get the unique strings
strng <- unique(unlist(strsplit(df$V1,';')))
# create a list with the rows for each unique string
lst <- lapply(strng, function(x) grep(x, df$V1, fixed = TRUE))
# get the counts for each string
count <- lengths(lst)
# collpase the list string positions into a string with the rownumbers for each string
pos <- sapply(lst, toString)
# put everything together in one dataframe
d <- data.frame(strng, count, pos)
You can shorten this approach to:
d <- data.frame(strng = unique(unlist(strsplit(df$V1,';'))))
lst <- lapply(d$strng, function(x) grep(x, df$V1, fixed = TRUE))
transform(d, count = lengths(lst), pos = sapply(lst, toString))
The result:
> d
strng count pos
1 mRNA 1 1
2 stimulus 3 1, 6, 11
3 transport 4 1, 5, 9, 11
4 MAPK cascade 2 2, 5
5 cell and biogenesis 3 2, 5, 9
6 targeting 2 3, 4
7 differentiation 1 6
8 metabolic process 2 6, 11
9 regulation 2 6, 11
A possible data.table solution for completeness
library(data.table)
setDT(df)[, .(.I, unlist(tstrsplit(V1, ";", fixed = TRUE)))
][!is.na(V2), .(count = .N, pos = toString(sort(I))),
by = .(String = V2)]
# String count pos
# 1: mRNA 1 1
# 2: MAPK cascade 2 2, 5
# 3: targeting 2 3, 4
# 4: differentiation 1 6
# 5: cell and biogenesis 3 2, 5, 9
# 6: metabolic process 2 6, 11
# 7: stimulus 3 1, 6, 11
# 8: transport 4 1, 5, 9, 11
# 9: regulation 2 6, 11
This is basically splits V1 column by ; while converting to a long format while simultaneously binding it with the row index (.I). Afterwards it's just a simple aggregation on row count (.N) and binding positions into a single string per String.