Loading text file into R to analyze chat log - r

So, I have been trying to turn a text file (each line is a chat log) into R to turn it into a data frame and further tidy the data.
I am using read.Lines so I can have each log as a single line. Because read.Lines reads them a single long char; I then convert them to strings (I need to parse the log); as per below
rawchat <- readLines("disc-W-App-avec-loy.txt")
rawchat <- c(lapply(rawchat, toString))
My problem comes when I want to turn this list into data frame:
rawchat <- as.data.frame(rawchat)
It turns the list into a data frame of 1 observation of 42,000 variables. The intention was to turn it into 42,000 observations of one variable.
Any help please?
By the way, I am pretty new in tidying raw data in R.
So, I encountered another block:
I loaded a text file as data frame as per below.
rawchat <- readLines("disc-W-App-avec-loy.txt")
rawchat <- as.data.frame(rawchat, stringsAsFactors=FALSE)
names(rawchat) <- "chat"
I am currently trying to identify any row (42000) that starts with the number 16. I can't seem to apply correctly the startsWith() function or the dplyr starts_with(), even grepl with regular expressions.
Could it be the format of the observations of the data frame (chr)?

The problem is your rawchat <- c(lapply(rawchat, toString))
Just use
rawchat <- readLines("disc-W-App-avec-loy.txt")")
rawchat <- as.data.frame(rawchat, stringsAsFactors=FALSE)

Related

How can I create a simple dataframe from nested, JSON format API content

Using a JSON format file pulled from the SeatGeek API, I'd like to convert the data into a data frame. I've managed to create a frame with all variables + data using the function below:
library(httr)
library(jsonlite)
vpg <- GET("https://api.seatgeek.com/2/venues?country=US&per_page=5000&page=1&client_id=NTM2MzE3fDE1NzM4NTExMTAuNzU&client_secret=77264dfa5a0bc99095279fa7b01c223ff994437433c214c8b9a08e6de10fddd6")
vpgc <- content(vpg)
vpgcv <- (vpgc$venues)
json_file <- sapply(vpgcv, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
as.data.frame(t(x))
})
From this point, I can create a data frame using:
venues.dataframe <- as.data.frame(t(json_file), flatten = TRUE)
But my resulting data is a data frame with the correct number of 23 variables and 5000 rows, but each entry is a list rather than just a value. How can I pull the value out of each list?
I've also attempted to pull the values out using data tables in the following code:
library(data.table)
data.table::rbindlist(json_file, fill= TRUE)
But the output data frame flows almost diagonally, placing 1 stored variable + 22 NULL values per row. While all the data exists here, Rows 1-23 (and 24-46, and so on) should be a single row.
Of these two dead ends, which is the easiest/cleanest solution to produce my desired data frame output of [5000 observations, in simple value form of 23 variables]?
Your url is connecting directly to the JSON file, no need for the GET function. The jsonlite library can handle the download directly.
library(jsonlite)
output<-fromJSON("https://api.seatgeek.com/2/venues?country=US&per_page=5000&page=1&client_id=NTM2MzE3fDE1NzM4NTExMTAuNzU&client_secret=77264dfa5a0bc99095279fa7b01c223ff994437433c214c8b9a08e6de10fddd6")
df<-output$venues
flatdf<-flatten(df)
#remove first column of empty lists
flatdf<-flatdf[,-1]
The variable "output" is a list of dataframes from the JSON object. One can reference using the "$" to retrieve the part of interest.
df does have some imbedded data frames, to flatten, use the flatten function from jsonlite package.

Altering dataframes stored within a list

I am trying to write some kind of loop function that will allow me to apply the same set of code to dozens of data frames that are stored in one list. Each data frame has the same number of columns and identical headers for each column, though the number of rows varies across data frames.
This data comes from an egocentric social network study where I collected ego-network data in edgelist format from dozens of different respondents. The data collection software that I use stores the data from each interview in its own .csv file. Here is an image of the raw data for a specific data frame (image of raw data).
For my purposes, I only need to use data from the fourth, sixth, and seventh columns. Furthermore, I only need rows of data where the last column has values of 4, at which point the final column can be deleted entirely. The end result is a two-column data frame that represents relationships among pairs of people.
After reading in the data and storing it as an object, I ran the following code:
x100291 = `100291AlterPair.csv` #new object based on raw data
foc.altername = x100291$Alter.1.Name
altername = x100291$Alter.2.Name
tievalue = x100291$AlterPair_B
tie = tievalue
tie[(tie<4)] = NA
egonet.name = data.frame(foc.altername, altername, tievalue)
depleted.name = cbind(tie,egonet.name)
depleted.name = depleted.name[is.na(depleted.name[,1]) == F,]
dep.ego.name = data.frame(depleted.name$foc.altername, depleted.name$altername)
This produced the following data frame (image of final data). This is ultimately what I want.
Now I know that I could cut-and-paste this same set of code 100+ times and manually alter the file names, but I would prefer not to do that. Instead, I have stored all of my raw .csv files as data frames in a single list. I suspect that I can apply the same code across all of the data frames by using one of the apply commands, but I cannot figure it out.
Does anyone have any suggestions for how I might apply this basic code to a list of data frames so that I end up with a new list containing cleaned and reduced versions of the data?
Many thanks!
The logic can be simplified. Try creating a custom function and apply over all dataframes.
cleanDF <- function(mydf) {
if( all(!c('AlterPair_B', 'Alter.1.Name', 'Alter.2.Name') %in%
names(mydf))) stop("Check data frame names")
condition <- mydf[, 'AlterPair_B'] >= 4
mydf[condition, c("Alter.1.Name", "Alter.2.Name")]
}
big_list <- lapply(all_my_files, read.csv) #read in all data frames
result <- do.call('rbind', lapply(big_list, cleanDF))
The custom function cleanDF first checks that all the relevant column names are there. Then it defines the condition of 4 or more 'AlterPair_B'. Lastly, subset the two target columns by that condition. I used a list called 'big_list' that represents all of the data frames.
You haven't provided a reproducible example so it's hard to solve your problem. However, I don't want your questions to remain unanswered. It is true that using lapply would be a fast solution, usually preferable to a loop. However, since you mentioned being a beginner, here's how to do that with a loop, which is easier to understand.
You need to put all your csv files in a single folder with nothing else. Then, you read the filenames and put them in a list. You initialize an empty result object with NULL. You then read all your files in a loop, do calculations and rbind the results in the result object.
path <-"C:/temp/csv/"
list_of_csv_files <- list.files(path)
result <- NULL
for (filenames in list_of_csv_files) {
input <- read.csv(paste0(path,filenames), header=TRUE, stringsAsFactors=FALSE)
#Do your calculations
input_with_calculations <- input
result <- rbind(result,input_with_calculations)
}
result

For Loop Over List of Data Frames and Create New Data Frames from Every Iteration Using Variable Name

I cannot for the life of me figure out where the simple error is in my for loop to perform the same analyses over multiple data frames and output each iteration's new data frame utilizing the variable used along with extra string to identify the new data frame.
Here is my code:
john and jane are 2 data frames among many I am hoping to loop over and compare to bcm to find duplicate results in rows.
x <- list(john,jane)
for (i in x) {
test <- rbind(bcm,i)
test$dups <- duplicated(test$Full.Name,fromLast=T)
test$dups2 <- duplicated(test$Full.Name)
test <- test[which(test$dups==T | test$dups2==T),]
newname <- paste("dupl",i,sep=".")
assign(newname, test)
}
Thus far, I can either get the naming to work correctly without including the x data or the loop to complete correctly without naming the new data frames correctly.
Intended Result: I am hoping to create new data frames dupl.john and dupl.jane to show which rows are duplicated in comparison to bcm.
I understand that lapply() might be better to use and am very open to that form of solution. I could not figure out how to use it to solve my problem, so I turned to the more familiar for loop.
EDIT:
Sorry if I'm not being more clear. I have about 13 data frames in total that I want to run the same analysis over to find the duplicate rows in $Full.Name. I could do the first 4 lines of my loop and then dupl.john <- test 13 times (for each data frame), but I am purposely trying to write a for loop or lapply() to gain more knowledge in R and because I'm sure it is more efficient.
If I understand correctly based on your intended result, maybe using the match_df could be an option.
library(plyr)
dupl.john <- match_df(john, bcm)
dupl.jane <- match_df(jane, bcm)
dupl.john and dupl.jane will be both data frames and both will have the rows that are in these data frames and bcm. Is this what you are trying to achieve?
EDITED after the first comment
library(plyr)
l <- list(john, jane)
res <- lapply(l, function(x) {match_df(x, bcm, on = "Full.Name")} )
dupl.john <- as.data.frame(res[1])
dupl.jane <- as.data.frame(res[2])
Now, res will have a list of the data frames with the matches, based on the column "Full.Name".

Storing data within R function

I'm writing a function in R that plots some data submitted by the user. The plot area has some polygons defined by a data frame that is constant, does not depend on the submitted data. The dataframe is read from a csv file that has 26 rows and 13 columns.
To make the R file as portable as possible I decided to keep the data frame within the file. As there are quite a lot columns, I come up with the following idea:
csv_data <- c(
"h1,h2,h3
v11,v21,v31
v12,v22,v32
v13,v23,v33"
)
write(csv_data, file="temp.csv")
df <- read.csv("temp.csv",header=T)
OK, I know this is kind of disgusting. but I don't want to reorganize the original csv to make the data frame in the conventional way, as the dataset is quite big:
h1 <- c(v11, v12, v13)
h2 <- c(v21, v22, v23)
h3 <- c(v31, v32, v33)
df <- data.frame(h1,h2,h3)
So, is there any more appropriate way to achieve this? Thank you very much!
If want to make a data.frame from an array of character variables, how about
df<-read.csv(text=csv_data, header=T)
At least that way you don't need the write.table.

Aggregate Function with Variable By List

I'm trying to create an R Script to summarize measures in a data frame. I'd like it to react dynamically to changes in the structure of the data frame. For example, I have the following block.
library(plyr) #loading plyr just to access baseball data frame
MyData <- baseball[,cbind("id","h")]
AggHits <- aggregate(x=MyData$h, by=list(MyData[,"id"]), FUN=sum)
This block creates a data frame (AggHits) with the total hits (h) for each player (id). Yay.
Suppose I want to bring in the team. How do I change the by argument so that AggHits has the total hits for each combination of "id" and "team"? I tried the following and the second line throws an error: arguments must have same length
MyData <- baseball[,cbind("id","team","h")]
AggHits <- aggregate(x=MyData$h, by=list(MyData[,cbind("id","team")]), FUN=sum)
More generally, I'd like to write the second line so that it automatically aggregates h by all variables except h. I can generate the list of variables to group by pretty easily using setdiff.
# set the list of variables to summarize by as everything except hits
SumOver <- setdiff(colnames(MyData),"h")
# total up all the hits - again this line throws an error
AggHits <- aggregate(x=MyData$h, by=list(MyData[,cbind(SumOver)]), FUN=sum)
The business purpose I'm using this for involves a csv file which has a single measure ($) and currently has about a half dozen dimensions (product, customer, state code, dates, etc.). I'd like to be able to add dimensions to the csv file without having to edit the script each time.
I should mention that I've been able to accomplish this using ddply, but I know that using ddply to summarize a single measure is wasteful in regards to run time; aggregate is much faster.
Thanks in advance!
ANSWER (specific to example in question)
Block should be
MyData <- baseball[,cbind("id","team","h")]
SumOver <- setdiff(colnames(MyData),"h")
AggHits <- aggregate(x=MyData$h, by=MyData[SumOver], FUN=sum)
This aggregates by every non-integer column (ID, Team, League), but more generically shows a strategy to aggregate over an arbitrary list of columns (by=MyData[cols.to.group.on]):
MyData <- plyr::baseball
cols <- names(MyData)[sapply(MyData, class) != "integer"]
aggregate(MyData$h, by=MyData[cols], sum)
Here is a solution using aggregate from base R
data(baseball, package = "plyr")
MyData <- baseball[,c("id","h", "team")]
AggHits <- aggregate(h ~ ., data = MyData, sum)

Resources