editing to provide further clarification on the requirement
I'm fairly new in R and I've currently encountered a road block when I was tidying up my data.
My current data looks like this.
Data
1 AAA TEXT Here
2 ZX
3 YX
4 ****
5 BBB Text Here
6 AL
7 TP
8 XY
9 ******
10 CCC Text Here
11 PP
12 QV
13 ******
AAA, BBB, CCC are like my 'identifiers' and the *** means the end of the related lines to the identifiers. In this sample output, I would only want to extract BBB and the next 3 lines after it. I would need to select in-between rows and transform my table to just this:
Data
1 BBB Text Here
2 AL
3 TP
4 XY
Can you please help? Thanks!
Hmm. Your method of data storage is not what any of us would recommend, but if what you have written is indeed how you have stored your data then you can use a method outlined in this answer to find the row number of the line matching your specified identifier.
# Set up test 'identifier' value
WantedIdentifier = "BBB Text Here"
# Get matching row number
RowNo =
which(Text == WantedIdentifier, arr.ind=TRUE)[1]
# Return from that row to the third beyond
ReturnedText =
if(!is.na(RowNo)) data.frame(Data = Text[RowNo:(RowNo+3),]) else NA
# Value returned
> ReturnedText
Data
1 BBB Text Here
2 AL
3 TP
4 XY
Test data setup
Text=
read.table(text = "Data
'AAA TEXT Here'
'ZX'
'YX'
'****'
'BBB Text Here'
'AL'
'TP'
'XY'
'******'
'CCC Text Here'
'PP'
'QV'
'******'", header = TRUE, stringsAsFactors = FALSE)
Related
I have a data frame on name (df) as follows.
ID name
1 Xiaoao
2 Yukata
3 Kim
4 ...
Examples of API are like this.
European-SouthSlavs,0.2244 Muslim-Pakistanis-Bangladesh,0.0000 European-Italian-Italy,0.0061 ...
And I would like to add a new column using API that returns nationality scores up to 39 nationalities and I would like to list up to top 3 scores per name. My desired outcomes as follows.
ID name score nat
1 Xiaoao 0.7361 Chinese
1 Xiaoao 0.1721 Korean
1 Xiaoao 0.0721 Japanese
2 Yukata 0.8121 Japanese
2 Yukata 0.0811 Chinese
2 Yukata 0.0122 Korean
3 Kim 0.6532 Korean
3 Kim 0.2182 Chinese
3 Kim 0.0981 Japanese
4 ... ... ...
Below is my some of scratch to get it done. But I failed to get the desired outcomes for a number errors.
df_result <- purrr::map_dfr(df$name, function(name) {
result <- GET(paste0("http://www.name-prism.com/api_token/nat/csv/",
"API TOKEN","/", URLencode(df$name)))
if(http_error(result)){
NULL
}else{
nat<- content(result, "text")
nat<- do.call(rbind, strsplit(strsplit(nat, split = "(?<=\\d)\n", perl=T)[[1]],","))
#first three nationalities
top_nat <- nat[order(as.numeric(nat[,2]), decreasing = T)[1:3],]
c(df$name,as.vector(t(top_nat)))
}
})
First, the results of top scores were based on the entire data rather than per name.
Second, I faced an error saying "Error in dplyr::bind_rows():! Argument 1 must have names."
If you can add any comments on my codings, I will appreciate it!
Thank you in advance.
The output of each iteration of the map_dfr should be a dataframe for which to bind rows:
library(tidyverse)
library(httr)
df <- data.frame(name = c("Xiaoao", "Yukata", "Kim"))
map_dfr(df$name, function(name) {
data.frame(name = df$name, score = sample(1:10, 1))
})
Instead of concatenating name with top_nat at the end of your function, you should be making it a data.frame!
Currently having an issue importing a data set of tweets so that every observation is in one column
This is the data before import; it includes three cells for each tweet, and a blank space in between.
T 2009-06-11 00:00:03
U http://twitter.com/imdb
W No Post Title
T 2009-06-11 16:37:14
U http://twitter.com/ncruralhealth
W No Post Title
T 2009-06-11 16:56:23
U http://twitter.com/boydjones
W listening to "Big Lizard - The Dead Milkmen" ♫ http://blip.fm/~81kwz
library(tidyverse)
tweets1 <- read_csv("tweets.txt.gz", col_names = F,
skip_empty_rows = F)
This is the output:
Parsed with column specification:
cols(
X1 = col_character()
)
Warning message:
“71299 parsing failures.
row col expected actual file
35 -- 1 columns 2 columns 'tweets.txt.gz'
43 -- 1 columns 2 columns 'tweets.txt.gz'
59 -- 1 columns 2 columns 'tweets.txt.gz'
71 -- 1 columns 5 columns 'tweets.txt.gz'
107 -- 1 columns 3 columns 'tweets.txt.gz'
... ... ......... ......... ...............
See problems(...) for more details.
”
# A tibble: 1,220,233 x 1
X1
<chr>
1 "T\t2009-06-11 00:00:03"
2 "U\thttp://twitter.com/imdb"
3 "W\tNo Post Title"
4 NA
5 "T\t2009-06-11 16:37:14"
6 "U\thttp://twitter.com/ncruralhealth"
7 "W\tNo Post Title"
8 NA
9 "T\t2009-06-11 16:56:23"
10 "U\thttp://twitter.com/boydjones"
# … with 1,220,223 more rows
The only issue are the many parsing failures, where problems(tweets1) shows that R expected one column, but got multiple. Any ideas on how to fix this? My output should provide me with 1.4 million rows according to my Professor, so unsure if this parsing issue is the key here. Any help is appreciated!
Maybe something like this will work for you.
data
data <- 'T 2009-06-11 00:00:03
U http://twitter.com/imdb
W No Post Title
T 2009-06-11 16:37:14
U http://twitter.com/ncruralhealth
W No Post Title
T 2009-06-11 16:56:23
U http://twitter.com/boydjones
W listening to "Big Lizard - The Dead Milkmen" ♫ http://blip.fm/~81kwz'
For a large file, fread() should be quick. The sep = NULL is saying basically just read in full lines. You will replace input = data with file = "tweets.txt.gz".
library(data.table)
read_rows <- fread(input = data, header = FALSE, sep = NULL, blank.lines.skip = TRUE)
processing
You could just stay with data.table, but I noticed you in the tidyverse already.
library(dplyr)
library(stringr)
library(tidyr)
Basically I am grabbing the first character (T, U, W) and storing it into a variable called Column. I am adding another column called Content for the rest of the string, with white space trimmed on both ends. I also added an ID column so I know how to group the clusters of 3 rows.
Then you basically just pivot on the Column. I am not sure if you wanted this last step or not, so remove as needed.
read_rows %>%
mutate(ID = rep(1:3, each = n() / 3),
Column = str_sub(V1, 1, 1),
Content = str_trim(str_sub(V1, 2))) %>%
select(-V1) %>%
pivot_wider(names_from = Column, values_from = Content)
result
# A tibble: 3 x 4
ID T U W
<int> <chr> <chr> <chr>
1 1 2009-06-11 00:00:03 http://twitter.com/imdb No Post Title
2 2 2009-06-11 16:37:14 http://twitter.com/ncruralhealth No Post Title
3 3 2009-06-11 16:56:23 http://twitter.com/boydjones "listening to \"Big Lizard - The Dead Milkmen\" ♫ http://blip.fm/~81kwz"
I have a .csv file with the following type of data:
Day Item
1 12,19,24,31,48,
1 1,19,
1 16,28,32,45,
1 19,36,41,43,44,
1 7,24,27,
1 21,31,33,41,
1 46
1 50
2 12,31,36,48,
2 17,29,47,
2 2,18,20,29,38,39,40,41
2 17,29,47,
And I can't get the read.transactions to read it properly.
The data set is based on several item selection for each day (more than one time per day, if necessary). For instance, the third selection on day 1, returned items 16,28,32, and 45.
Shouldn't this be enough?
library(arules)
dataset <- read.transactions("file.csv", format = 'basket')
I have tried to create a sample data using data provided by you
data <- read.table(text="Day Item
1 12,19,24,31,48,
1 1,19,
1 16,28,32,45,
1 19,36,41,43,44,
1 7,24,27,
1 21,31,33,41,
1 46
1 50
2 12,31,36,48,
2 17,29,47,
2 2,18,20,29,38,39,40,41
2 17,29,47",header = T)
data <- as(data[-1], "transactions") ##removing 1st header column for the transactional data
inspect(data)
## apply apriori algorithm ###
rules <- apriori(data, parameter = list(supp = 0.001, conf = 0.80))
### Arrange top 10 rules by lift ####
inspect(rules[1:10])
Please try this method hope it helps
I am using the sunburstR package to create a sunburst diagram but it is not working and I am not sure what I am doing wrong.
Raw data:
> sequences
V1
1 A-aa-aaa-end
2 A-aa-aaa-end
3 A-aa-vvv-end
4 A-aa-vvv-end
5 A-cc-vvv-end
6 A-cc-vvv-end
7 B-aa-vvv-end
8 B-aa-vvv-end
9 B-bb-rr-end
10 B-bb-rr-end
11 C-aa-rr-end
12 C-aa-rr-end
13 C-bb-rr-end
14 C-bb-rr-end
15 C-cc-rr-end
Code:
sequences <- read.csv(filepath, header=F ,stringsAsFactors = FALSE)
sunburst(sequences)
You need some values in the second column of your data frame...
sequences <- read.table(text = '
A-aa-aaa-end
A-aa-aaa-end
A-aa-vvv-end
A-aa-vvv-end
A-cc-vvv-end
A-cc-vvv-end
B-aa-vvv-end
B-aa-vvv-end
B-bb-rr-end
B-bb-rr-end
C-aa-rr-end
C-aa-rr-end
C-bb-rr-end
C-bb-rr-end
C-cc-rr-end
')
sequences$V2 <- seq_along(sequences$V1)
sequences
library(sunburstR)
sunburst(sequences)
You are missing the count part. Try sunburst(data.frame(table(sequences$V1))) and it should work as expected.
PS : not tested without the sequences dataframe.
I'm looking for an easy fix to read a txt file that looks like this when opened in excel:
IDmaster By_uspto App_date Grant_date Applicant Cited
2 1 19671106 19700707 Motorola Inc 1052446
2 1 19740909 19751028 Gen Motors Corp 1062884
2 1 19800331 19820817 Amp Incorporated 1082369
2 1 19910515 19940719 Dell Usa L.P. 389546
2 1 19940210 19950912 Schueman Transfer Inc. 1164239
2 1 19940217 19950912 Spacelabs Medical Inc. 1164336
EDIT: Opening the txt file in notepad looks like this (with commas). The last two rows exhibit the problem.
IDmaster,By_uspto,App_date,Grant_date,Applicant,Cited
2,1,19671106,19700707,Motorola Inc,1052446
2,1,19740909,19751028,Gen Motors Corp,1062884
2,1,19800331,19820817,Amp Incorporated,1082369
2,1,19910515,19940719,Dell Usa L.P.,389546
2,1,19940210,19950912,Schueman Transfer, Inc.,1164239
2,1,19940217,19950912,Spacelabs Medical, Inc.,1164336
The problem is that some of the Applicant names contain commas so that they are read as if they belong in a different column, which they actually don't.
Is there a simple way to
a) "teach" R to keep string variables together, regardless of commas in between
b) read in the first 4 columns, and then add an extra column for everything behind the last comma?
Given the length of the data I can't open it entirely in excel which would be otherwise a simple alternative.
If your example is written in a "Test.csv" file, try with:
read.csv(text=gsub(', ', ' ', paste0(readLines("Test.csv"),collapse="\n")),
quote="'",
stringsAsFactors=FALSE)
It returns:
# IDmaster By_uspto App_date Grant_date Applicant Cited
# 1 2 1 19671106 19700707 Motorola Inc 1052446
# 2 2 1 19740909 19751028 Gen Motors Corp 1062884
# 3 2 1 19800331 19820817 Amp Incorporated 1082369
# 4 2 1 19910515 19940719 Dell Usa L.P. 389546
# 5 2 1 19940210 19950912 Schueman Transfer Inc. 1164239
# 6 2 1 19940217 19950912 Spacelabs Medical Inc. 1164336
This provides a very silly workaround but it does the trick for me (because I don't really care about the Applicant names atm. However, I'm hoping for a better solution.
Step 1: Open the .txt file in notepad, and add five column names V1, V2, V3, V4, V5 (to be sure to capture names with multiple commas).
bc <- read.table("data.txt", header = T, na.strings = T, fill = T, sep = ",", stringsAsFactors = F)
library(data.table)
sapply(bc, class)
unique(bc$V5) # only NA so can be deleted
setDT(bc)
bc <- bc[,1:10, with = F]
bc$Cited <- as.numeric(bc$Cited)
bc$Cited[is.na(bc$Cited)] <- 0
bc$V1 <- as.numeric(bc$V1)
bc$V2 <- as.numeric(bc$V2)
bc$V3 <- as.numeric(bc$V3)
bc$V4 <- as.numeric(bc$V4)
bc$V1[is.na(bc$V1)] <- 0
bc$V2[is.na(bc$V2)] <- 0
bc$V3[is.na(bc$V3)] <- 0
bc$V4[is.na(bc$V4)] <- 0
head(bc, 10)
bc$Cited <- with(bc, Cited + V1 + V2 + V3 + V4)
It's a silly patch but it does the trick in this particular context