Importing non-rectangular data as rectangular in R - r

I need to load social network data where each user has an unknown and potentially large number of friends, stored as a text file of the following format:
UserId: FriendId1, FriendId2, ...
1: 12, 33
2:
3: 4, 6, 10, 15, 16
into a two-column data.frame:
UserId FriendId
1 1 12
2 1 33
3 3 4
4 3 6
5 3 10
6 3 15
7 3 16
How would you do that in R?
Reading, filling and then reshaping is inefficient as it requires to keep in memory many columns full of NA.
Related questions here, and here.

If you really have a colon as a delimiter, then just use read.table with header = FALSE to get your data into R, then consider using cSplit from my "splitstackshape" package.
mydf <- read.table("test.txt", sep = ":", header = FALSE)
mydf
## V1 V2
## 1 1 12, 33
## 2 2
## 3 3 4, 6, 10, 15, 16
library(splitstackshape)
cSplit(mydf, "V2", ",", "long")
## V1 V2
## 1: 1 12
## 2: 1 33
## 3: 3 4
## 4: 3 6
## 5: 3 10
## 6: 3 15
## 7: 3 16

This reads the lines, then one-by-one parses them into two column matrices. This does produce character values (since lines of text are just characters) but it's trivial to coerce to numeric:
do.call(rbind, sapply(rLines, function(L) { n <- sub( ":.+", "", L);
items <- scan(text=sub(".+:","",L), sep=",");
matrix( c( rep(n, length(items)), items), ncol=2)}
)
)
#---------
[,1] [,2]
[1,] "1" "12"
[2,] "1" "33"
[3,] "3" "4"
[4,] "3" "6"
[5,] "3" "10"
[6,] "3" "15"
[7,] "3" "16"
If the path forward isn't trivial to you then educate yourself at ?as.numeric and ?as.data.frame.

Related

How to change the class of a column in a list of a list from character to numeric in r?

The codes for producing sample dataset and converting from character to numeric is as below:
ff = data.frame(a = c('1','2','3'),b = 1:3, c = 5:7)
#data.frame is a type of list.
fff = list(ff,ff,ff,ff)
k = fff %>% map(~map(.x,function(x){x['a'] %<>% as.numeric
return(x)}))
However, the result is something like this...:
There are 3 lists appear in each of the nested list ==> 33 = 9, which is very strange.
I think the result should have 3 lists in a nested list.==> 31 = 3
what I want is to convert every a in each dataframe to be numeric.
> k
[[1]]
[[1]]$a
a
"1" "2" "3" NA
[[1]]$b
a
1 2 3 NA
[[1]]$c
a
5 6 7 NA
[[2]]
[[2]]$a
a
"1" "2" "3" NA
[[2]]$b
a
1 2 3 NA
[[2]]$c
a
5 6 7 NA
[[3]]
[[3]]$a
a
"1" "2" "3" NA
[[3]]$b
a
1 2 3 NA
[[3]]$c
a
5 6 7 NA
[[4]]
[[4]]$a
a
"1" "2" "3" NA
[[4]]$b
a
1 2 3 NA
[[4]]$c
a
5 6 7 NA
I cannot understand why I cannot convert a into numeric...
Like this, with mutate:
fff %>%
map(~ mutate(.x, a = as.numeric(a)))
Or, more base R style:
fff %>%
map(\(x) {x$a <- as.numeric(x$a); x})
You should use map only once, because you don't have a nested list. With the first map, you access to each dataframe, and then you can convert to numeric. With a second map, you are accessing the columns of each data frame (which you don't want).
With two maps, it's also preferable to use \ or function rather than ~ because it becomes confusing to use .x and x for different objects. In your question, .x is the dataframe, while x are columns of it.

Pivot table of concatenated string in r

I have the following dataset:
mydata<- data.frame(Factors= c("a,b" , "c,d" , "a,c"), Valu = c ("2,3" , "7,8" , "9,1"))
Factors Valu
1 a,b 2,3
2 c,d 7,8
3 a,c 9,1
and I wish to convert to the following which has all the values that happend with a factor:
My ideal output
a b c d
2 2 7 7
3 3 8 8
9 9
1 1
I need a pivot table. However I need to prepare the data and then use melt and dcast have my desirable output: one of fail tries for preparing data is :
mydata2 <- cSplit(mydata, c("Factors","Valu") , ",", "long")
But they lose their connections.
Here is an one-line code with cSplit
library(splitstackshape)
with(cSplit(cSplit(mydata, 1, ",", "long"), 2, ",", "long"), split(Valu, Factors))
#$a
#[1] 2 3 9 1
#$b
#[1] 2 3
#$c
#[1] 7 8 9 1
#$d
#[1] 7 8
If we need a data.table/data.frame, use dcast to convert the 'long' format to 'wide'.
dcast(cSplit(cSplit(mydata, 1, ",", "long"), 2, ",", "long"),
rowid(Factors)~Factors, value.var="Valu")[, Factors := NULL][]
# a b c d
#1: 2 2 7 7
#2: 3 3 8 8
#3: 9 NA 9 NA
#4: 1 NA 1 NA
NOTE: splitstackshape loads the data.table. Here, we used data.table_1.10.0. The dcast from data.table is also very fast
Using a couple of *applys, strsplit and grep
## convert columns to characters so you can use strsplit
mydata$Factors <- as.character(mydata$Factors)
mydata$Valu <- as.character(mydata$Valu)
## get all the unique factor values by splitting them
f <- unique(unlist(strsplit(unique(mydata$Factors), split = ",")))
## filter 'mydata' by using 'grep' to search for each individual factor value
## (using sapply for one at a time)
l <- sapply(f, function(x) mydata[grep(x, mydata$Factors), "Valu"])
This gives a list, where each element is named by the 'Factor' value, and it contains all the 'Valu' values associated with it
l
# $a
# [1] "2,3" "9,1"
#
# $b
# [1] "2,3"
#
# $c
# [1] "7,8" "9,1"
#
# $d
# [1] "7,8"
Another lapply on this list will split the 'Valu's
result <- lapply(l, function(x) unlist(strsplit(x, split = ",")))
result
# $a
# [1] "2" "3" "9" "1"
#
# $b
# [1] "2" "3"
#
# $c
# [1] "7" "8" "9" "1"
#
# $d
# [1] "7" "8"
Edit
To get the result in a data.frame, you can make each list element the same length (by filling with NA), then call data.frame on the result
## the number of rows required for each column
maxLength <- max(sapply(result, length))
## append 'NA's to list with fewer than maxLenght lements
result <- data.frame(sapply(result, function(x) c(x, rep(NA, maxLength - length(x))) ))
result
# a b c d
# 1 2 2 7 7
# 2 3 3 8 8
# 3 9 <NA> 9 <NA>
# 4 1 <NA> 1 <NA>
Edit
In response to the comment, if you have 'similar' strings, you can make your grep regex explicit by using ( ) (see any regex cheatsheet for explanations)
mydata<- data.frame(Factors= c("a,b" , "c,d" , "a,c", "bo,ao"), Valu = c ("2,3" , "7,8" , "9,1", "x,y"))
mydata$Factors <- as.character(mydata$Factors)
mydata$Valu <- as.character(mydata$Valu)
f <- unique(unlist(strsplit(unique(mydata$Factors), split = ",")))
## filter 'mydata' by using 'grep' to search for each individual factor value
## (using sapply for one at a time)
l <- sapply(f, function(x) mydata[grep(paste0("(",x,")"), mydata$Factors), "Valu"])
Another base R attempt:
# character conversion first
mydata[] <- lapply(mydata, as.character)
long <- do.call(rbind,
do.call(Map, c(expand.grid, lapply(mydata, strsplit, ","), stringsAsFactors=FALSE))
)
split(long$Valu, long$Factors)
#$a
#[1] "2" "3" "9" "1"
#
#$b
#[1] "2" "3"
#
#$c
#[1] "7" "8" "9" "1"
#
#$d
#[1] "7" "8"
I misunderstood in my comment above; if you want every Factor to match every Valu, you need to separate the columns independently to get the combinations. If you add indices to spread by, it's not too bad:
library(tidyverse)
mydata %>%
separate_rows(Factors) %>% separate_rows(Valu, convert = TRUE) %>%
# add indices to give row order when spreading
group_by(Factors) %>% mutate(i = row_number()) %>%
spread(Factors, Valu) %>%
select(-i) # clean up extra column
## # A tibble: 4 × 4
## a b c d
## * <int> <int> <int> <int>
## 1 2 2 7 7
## 2 3 3 8 8
## 3 9 NA 9 NA
## 4 1 NA 1 NA

Convert Strings into data.frame using R

I have 1000+ rows of string which I extracted from a column of an Excel worksheet. Here's how the data looks like (3 rows):
Chicken(31%);Duck(16%);Wild duck(14%);Turkey(10%);Pigeon(4%);Goose(4%);Wild bird(4%);Tree sparrow(2%)
Tree sparrow(2%)
Chicken(1%)
I need to put the data into a table (for this example: 8 columns x 3 rows). Can anyone help?
x <- c("Chicken(31%);Duck(16%);Wild duck(14%);Turkey(10%);Pigeon(4%);Goose(4%);Wild bird(4%);Tree sparrow(2%)",
"Tree sparrow(2%)", "Chicken(1%)")
There is most likely more concise way but you can try something like this:
library(stringi)
library(data.table)
# Drop empty lines if any
txt <- Filter(function(x) !stri_isempty(stri_trim(x)), x)
# Extract matches
matches <- stri_match_all_regex(txt, "([\\w\\s]+)\\(([1-9]+)%\\);?")
matches[[1]]
## [,1] [,2] [,3]
## [1,] "Chicken(31%);" "Chicken" "31"
## [2,] "Duck(16%);" "Duck" "16"
## [3,] "Wild duck(14%);" "Wild duck" "14"
## [4,] "Pigeon(4%);" "Pigeon" "4"
## [5,] "Goose(4%);" "Goose" "4"
## [6,] "Wild bird(4%);" "Wild bird" "4"
## [7,] "Tree sparrow(2%)" "Tree sparrow" "2"
# Rearrange
rows <- lapply(
matches,
function(x) setNames(as.list(as.numeric(x[, 3])), x[, 2]))
rbindlist(rows, fill=TRUE)
## Chicken Duck Wild duck Pigeon Goose Wild bird Tree sparrow
## 1: 31 16 14 4 4 4 2
## 2: NA NA NA NA NA NA 2
## 3: 1 NA NA NA NA NA NA
Regex explanation
([\\w\\s]+) # At least one word character or whitespace *, 1st group
\\( # Left parenthesis
([1-9]+) # At least one digit. You can replace + with {1,2}, 2nd group
% # Percent sign
\\) # Right parenthesis
;? # Optional semicolon
* Could be \\w[\\w\\s]+
Here's on possible solution:
library(qdapTools)
mtabulate(strsplit(gsub("\\(\\d+%\\)", "", x), ";"))
## Chicken Duck Goose Pigeon Tree sparrow Turkey Wild bird Wild duck
## 1 1 1 1 1 1 1 1 1
## 2 0 0 0 0 1 0 0 0
## 3 1 0 0 0 0 0 0 0

Combining lists as rows

I am new to R and i was wondering if there is a way to create a dataframe through lists. Here is an example.
n = c(1,4,5)
b = c(7,19,20)
v = c(3,8,9,4,5)
x = list(n,b,v)
If i use the command x i get columns. Is there i can combine them as rows if they have similar headers(like employee, count,id, row number, pages, page visits) and create a dataframe like this?
employee | count | id |row number| pages| page visits
1 4 5
7 19 20
3 8 9 4 5
You can try stri_list2matrix from the "stringi" package:
library(stringi)
stri_list2matrix(x, byrow = TRUE)
# [,1] [,2] [,3] [,4] [,5]
# [1,] "1" "4" "5" NA NA
# [2,] "7" "19" "20" NA NA
# [3,] "3" "8" "9" "4" "5"
However, your sample data only has 5 columns and you are expecting to create a data.frame with 6 columns.
You can also try listCol_w from my "splitstackshape" package:
library(splitstackshape)
listCol_w(data.table(id = seq_along(x), x), "x", fill = NA_real_)
id x_fl_1 x_fl_2 x_fl_3 x_fl_4 x_fl_5
1: 1 1 4 5 NA NA
2: 2 7 19 20 NA NA
3: 3 3 8 9 4 5
The NA_real_ is so that the results can be retained as numeric. (NA_integer_ is also appropriate here.)
require(plyr)
Reduce(function(z,y) rbind.fill(z,
setNames( data.frame(as.list(y)), cnams[1:length(y )])),
x,
init=setNames(data.frame(as.list(cnams))[0,], cnams) )
employee count id row_number pages page_visits
1 1 4 5 <NA> <NA> <NA>
2 7 19 20 <NA> <NA> <NA>
3 3 8 9 4 5 <NA>

numerical values of the column of a matrix getting modified when converting into data.frame

Running on R 2.13, I want to have a data.frame of several column, the first one being of numeric type, the others of character type. When I am creating my object, the values of the first column are getting transformed in a way that I don't expect or understand. Please see the code below.
tmp <- cbind(1:10,rep("aa",10))
tmp
[,1] [,2]
[1,] "1" "aa"
[2,] "2" "aa"
[3,] "3" "aa"
[4,] "4" "aa"
[5,] "5" "aa"
[6,] "6" "aa"
[7,] "7" "aa"
[8,] "8" "aa"
[9,] "9" "aa"
[10,] "10" "aa"
tmp <- data.frame(tmp)
tmp
X1 X2
1 1 aa
2 2 aa
3 3 aa
4 4 aa
5 5 aa
6 6 aa
7 7 aa
8 8 aa
9 9 aa
10 10 aa
tmp[,1] <- as.numeric(tmp[,1])
tmp
X1 X2
1 1 aa
2 3 aa
3 4 aa
4 5 aa
5 6 aa
6 7 aa
7 8 aa
8 9 aa
9 10 aa
10 2 aa
For some reason, the values of the first column are getting changed. I must be doing something obviously wrong here, can someone point me a workaround?
> tmp <- data.frame(cbind(1:10,rep("aa",10)))
> str(tmp)
'data.frame': 10 obs. of 2 variables:
$ X1: Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
$ X2: Factor w/ 1 level "aa": 1 1 1 1 1 1 1 1 1 1
As you can see above, tmp$X1 got converted into a factor, which is what's causing the behaviour you're seeing.
Try:
tmp[,1] <- as.numeric(as.character(tmp[,1]))
#aix's answer is a correct diagnosis. However, probably what you want to do is to create a data frame directly:
data.frame(1:10,rep("aa",10))
Rather than cbinding first (which makes a matrix) and then converting to a data frame.
You might want to give your variables sensible names rather than the weird ones they will end up with via the data.frame command above (X1.10 and rep..aa...10.):
data.frame(var1=1:10,var2=rep("aa",10))
Since data.frame replicates its arguments, you can shorten this even a bit more:
data.frame(var1=1:10,var2="aa")
And if you really want a character vector rather than a factor for the second column, you can use stringsAsFactors=FALSE or wrap var2 in I() (i.e. var2=I("aa"))

Resources