Load csv file with different number of strings - r

How would I load data from a csv file into R if the file contains different numbers of strings in every line? I need to unite it in one variable (E.g. a list of lists?). The data in the file looks like this (I don't know the maximum number of elements in one row):
Peter; Paul; Mary
Jeff;
Peter; Jeff
Julia; Vanessa; Paul

Use fill=TRUE:
read.table(text='
Peter; Paul; Mary
Jeff;
Peter; Jeff
Julia; Vanessa; Paul',sep=';',fill=TRUE)
V1 V2 V3
1 Peter Paul Mary
2 Jeff
3 Peter Jeff
4 Julia Vanessa Paul

r <- readLines("tmp3.csv")
getLine <- function(x) {
r <- scan(text=x,sep=";",what="character",quiet=TRUE)
r <- r[nchar(r)>0] ## drop empties
r <- gsub("(^ +| +$)","",r) ## strip whitespace
r
}
lapply(r,getLine)
## [[1]]
## [1] "Peter" "Paul" "Mary"
##
## [[2]]
## [1] "Jeff"
##
## [[3]]
## [1] "Peter" "Jeff"
##
## [[4]]
## [1] "Julia" "Vanessa" "Paul"
This is technically a list of vectors rather than a list of lists but it might be what you want ...

Related

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0
Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).

How do I subset a list with mixed data type and data structure?

I have a list which included a mix of data type (character) and data structure (dataframe).
I want to keep only the dataframes and remove the rest.
> head(list)
[[1]]
[1] "/Users/Jane/R/12498798.txt error"
[[2]]
match
1 Japan arrests man for taking gun
2 Extradition bill turns ugly
file
1 /Users/Jane/R/12498770.txt
2 /Users/Jane/R/12498770.txt
[[3]]
[1] "/Users/Jane/R/12498780.txt error"
I expect the final list to contain only dataframes:
[[2]]
match
1 Japan arrests man for taking gun
2 Extradition bill turns ugly
file
1 /Users/Jane/R/12498770.txt
2 /Users/Jane/R/12498770.txt
Based on the example, it is possible that the OP's list elements are vectors and want to remove any element having 'error' substring
list[!sapply(list, function(x) any(grepl("error$", x)))]

How to convert a list with same type of field to a data.frame in R

I have a list and the field inside each list element is of same name(only values are different) and I need to convert that into a data.frame with column name is same as that of field name. Following is my list,
Data input (data input in json format.json)
library(rjson)
data <- fromJSON(file = "data input in json format.json")
head(data,3)
[[1]]
[[1]]$floors
[1] 5
[[1]]$elevation
[1] 15
[[1]]$bmi
[1] 23.7483
[[2]]
[[2]]$floors
[1] 4
[[2]]$elevation
[1] 12
[[2]]$bmi
[1] 23.764
[[3]]
[[3]]$floors
[1] 3
[[3]]$elevation
[1] 9
[[3]]$bmi
[1] 23.7797
And my expected data.frame is,
floors elevation bmi
5 15 23.7483
4 12 23.7640
3 9 23.7797
Can you help me to figure out this ?.
Thanks in adavance.
You can use jsonlite.
library(jsonlite)
Then use fromJSON() and specify the path to your file (or alternatively a URL or the raw text) in the argument txt:
fromJSON(txt = 'path/to/json/file.json')
The result is:
floors elevation bmi
1 5 15 23.7483
2 4 12 23.7640
3 3 9 23.7797
If you prefer rjson, you could first read it as previously:
data <- rjson::fromJSON(file = 'path/to/json/file.json')
Then use do.call() and rbind.data.frame() to convert the list to a dataframe:
do.call("rbind.data.frame", data)
Alternatively to do.call(): use data.tables rbindlist() which is faster:
data.table::rbindlist(data)

Convert character column to a list within the data frame

When I read the csv file into df, SoftwareOwner is a character column
> df
Software SoftwareOwner
<chr> <chr>
1 I-DEAS Siemens
2 TeamViewer Autodesk, TeamViewer, Siemens
3 Inventor PTC, Google, SpaceClaim, Bricys
4 AutoCAD Autodesk
I want to make SoftwareOwner a list within this data frame so I tried the simple solution
> df$SoftwareOwner <- as.list(df$SoftwareOwner)
But all this did was make each entry in the column a list with one entry
> df$SoftwareOwner[2]
[[1]]
[1] "Autodesk, TeamViewer, Siemens"
I've tried adding parameters like sep = "," and all.names = TRUE to as.list but neither worked. Is there any way to access just Autodesk or TeamViewer or Siemens when calling something like what I have just above?
Might I recommend making Siemens, Autodesk, Teamviewer, etc. their own columns and coding a 1 or 0 to indicate ownership? In my experience this is a far more flexible approach.
A possible solution :
# recreate your data.frame
df <- read.csv(text=
"Software;SoftwareOwner
I-DEAS;Siemens
TeamViewer;Autodesk, TeamViewer, Siemens
Inventor;PTC, Google, SpaceClaim, Bricys
AutoCAD;Autodesk",sep=";")
df$SoftwareOwner <- lapply(strsplit(as.character(df$SoftwareOwner),split=','),trimws)
# > df$SoftwareOwner
# [[1]]
# [1] "Siemens"
#
# [[2]]
# [1] "Autodesk" "TeamViewer" "Siemens"
#
# [[3]]
# [1] "PTC" "Google" "SpaceClaim" "Bricys"
#
# [[4]]
# [1] "Autodesk"
# > df$SoftwareOwner[[2]][3]
# [1] "Siemens"
# > df$SoftwareOwner[[3]][2]
# [1] "Google"

Weird behavior of R brackets

So I've been trying to get a subset of a character vector for the last hour or so. In my (floundering) attempt to get this working I ran into an interesting characteristic of R. I have data (after JSON parsing) in the form of
[[1]]
[[1]]$business_id
[1] "rncjoVoEFUJGCUoC1JgnUA"
[[1]]$full_address
[1] "8466 W Peoria Ave\nSte 6\nPeoria, AZ 85345"
[[1]]$open
[1] TRUE
[[1]]$categories
[1] "Accountants" "Professional Services" "Tax Services"
[4] "Financial Services"
[[1]]$city
[1] "Peoria"
[[1]]$review_count
[1] 3
[[1]]$name
[1] "Peoria Income Tax Service"
[[1]]$neighborhoods
list()
[[1]]$longitude
[1] -112.2416
[[1]]$state
[1] "AZ"
[[1]]$stars
[1] 5
[[1]]$latitude
[1] 33.58187
[[1]]$type
[1] "business"
Here's the code I'm using
#!/usr/bin/Rscript
require(graphics)
require(RJSONIO)
parsed_data <- lapply(readLines("yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json"), fromJSON)
#parsed_data[,c("categories")]
print(parsed_data[1])
As I was trying to drop everything but the categories column I ran into this interesting behaviour
print(parsed_data[1])
print(parsed_data[1][1])
print(parsed_data[1][1][1][1][1][1])
All produce the same output (the one posted above). Why is that?
This is the difference between [ and [[. It is hard to search for these online, but ?'[' will bring up the help.
When indexing a list with [, a list is returned:
list(a=1:10, b=11:20)[1]
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
This is a list of one element, so repeating the operation again results in the same value:
list(a=1:10, b=11:20)[1][1]
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
[[ returns the element, not a list containing the element. It also only accepts a single index (whereas [ accepts a vector):
list(a=1:10, b=11:20)[[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
And this operation is not idempotent on lists:
list(a=1:10, b=11:20)[[1]][[1]]
## [1] 1
Your JSON data is currently stored in a list, rather than a vector, so the indexing is different.
As Matthew has pointed out, there is a difference between using [] to access an element and using [[]]. For a discussion on this I will refer you to this stack overflow thread:
In R, what is the difference between the [] and [[]] notations for accessing the elements of a list?
Looking at the data print out your data is stored as a nested list:
parsed_data[[1]]
Will give you a list containing each of the columns. To access the categories column you can use any of the following:
parsed_data[[1]][["categories"]]
parsed_data[[1]][[4]]
parsed_data[[1]]$categories
This will give you a vector of names as a you'd expect:
## [1] "Accountants" "Professional Services" "Tax Services"
## [4] "Financial Services"
Note that when accessing by index (either named or numeric) you still have to use the double bracket notation: [[]]. If you use [] instead, it will give you a list instead of a vector:
parsed_data[[1]]["categories"]
## [[1]]
## [1] "Accountants" "Professional Services" "Tax Services"
## [4] "Financial Services"

Resources