Trouble reading a messy CSV file in R

Trouble reading a messy CSV file in R - r

I have been trying to read a CSV into R. The CSV is separated in a strange way with all values within one column separated by commas like in this picture. The top row is the column names and then below are the values
When I try read_csv("filename") nothing shows up in the tibble except a bunch of NA values like in this picture after running the view function . How can I approach this?
Here is the data for reference
, Calories, Fat (g), Carb. (g), Fiber (g), Protein (g)
Chonga Bagel,300,5,50,3,12
8-Grain Roll,380,6,70,7,10
Almond Croissant,410,22,45,3,10
Apple Fritter,460,23,56,2,7
Banana Nut Bread,420,22,52,2,6
Blueberry Muffin with Yogurt and Honey,380,16,53,1,6
Blueberry Scone,420,17,61,2,5
Butter Croissant,240,12,28,1,5
Butterfly Cookie,350,22,38,0,2
Cheese Danish,320,16,36,1,8
Chewy Chocolate Cookie,170,5,30,2,2
Chocolate Chip Cookie,310,15,42,2,4
Chocolate Chunk Muffin,440,21,60,2,7
Chocolate Croissant,330,18,38,1,6
Chocolate Hazelnut Croissant,390,22,43,2,7
Chocolate Marble Loaf Cake,490,24,64,2,6
Cinnamon Morning Bun,390,15,56,2,8
Cinnamon Raisin Bagel,270,1,58,3,9
Classic Coffee Cake,390,16,57,1,5
Cookie Butter Bar,360,23,36,0,2

Use the following code to read the data
df = read.csv("starbucks-menu-nutrition-food.csv", skipNul = T)
head(df, 2)
ÿþ Calories Fat..g. Carb...g. Fiber..g. Protein..g.
1 Chonga Bagel 300 5 50 3 12
2 8-Grain Roll 380 6 70 7 10
Then you may consider renaming the columns like for e.g.
colnames(df) <- c("Food", "Calories", "Fat", "Carb", "Fiber", "Protein")
for further processing of the data.

Related

Making new data frame in R

Say, I have a data table with three columns: apples, orange, and age.
What code I can write in R to make the other one with upper case: FRUITS, AGE, USE
apple
orange
age
FRUITS
AGE
USE
3
2
1-3
-
-
apple
1-3
3
4
5
4-6
-
-
apple
4-6
4
8
9
7-9
-
-
apple
7-9
8
-
-
orange
1-3
2
-
-
orange
4-6
5
-
-
orange
7-9
9
This is an example so I gives fewer values, but let's say my data have 30 rows like that. I do not want to manually add each rows into a new data frame. how can I turn the apples and oranges into FRUITS and make a column use?

I think using the pivot_longer() tidy function moodymudskipper suggested it would be coded like this
library(tidyr)
new_data <- data %>%
pivot_longer(!age, names_to = "FRUITS", values_to = "USE")%<%
select_all(toupper)

One possible way to solve your problem
library(data.table)
melt(as.data.table(df),
measure=c("apple", "orange"),
variable.name="FRUITS",
value.name="USE")

R | update column in dataframe based on conditions in other dataframe

I am having trouble finding a way to cleaning update the amount column in table 1 with the price column in table 2. I know that left_join and merge could be used to join the price column, rename it, and then drop it, but I am wondering if there is simpler way to avoid creating a mess.
I should state that the real dataset is more complicated and that the amount column in table 1 needs to be conditionally updated somehow based on table 2.
Table 1
Fruit
Vegetable
amount
apple
broccoli
pear
spinach
pineapple
carrot
Table 2
Fruit
Vegetable
price
apple
broccoli
10
pear
spinach
5
pineapple
carrot
2

If you don't want to use merge and update process you can use match.
table1$amount <- table2$price[match(paste(table1$Fruit, table1$Vegetable),
paste(table2$Fruit, table2$Vegetable))]

R - Converting table to a list for export to an csv line by line for each cell

I am working a solution of exporting a table in R as tab stopped csv-file, where each cell in the table is defined by a row in the export file. I did search for any similar problems already, but only found solutions to read in data into a table, but not export it with the necessary kind of structure.
I have already the code for reading in data and generating a table where a string is seperated and a variable for each word from a string is defined.
Now I am not succeeding in writing a code which generates the desired export structure.
The desired output should look like this for the example below:
ID0001 Butter
ID0001 Jelly
ID0001 Peanut
ID0002 Peanut
ID0003 Butter
ID0004 Storm
ID0005 Butter
ID0005 Storm
ID0005 Wind
I need exact that structure as shown above as further processing in another application is necessary.
Hope someone could help me with the code. I am thankful for any advice (sample code, packages, etc.)
#Generate Test data
df_TEST <- structure(list(ID = c("ID0001", "ID0002", "ID0003", "ID0004",
"ID0005"), strings = c("Peanut Butter Jelly", "Peanut", "Butter",
"Storm", "Storm Wind Butter")), class = "data.frame", row.names = c(NA, -5L))
#Get all unique words
all_words <- sort(unique(unlist(strsplit(df_TEST$strings, "\\W"))))
#Generate table
df_Result<-cbind(df_TEST, sapply(all_words, function(x) 1 * grepl(x, df_TEST$strings)))
#MISSING: Generate structure for output
#Code for csv-Export with tabstopp (done)

Maybe you can try the base R code below
word_list <- Map(unique,strsplit(df_TEST$strings,"\\W"))
dfout <- data.frame(ID = rep(df_TEST$ID,lengths(word_list)),
Word = unlist(word_list))
such that
> dfout
ID Word
1 ID0001 Peanut
2 ID0001 Butter
3 ID0001 Jelly
4 ID0002 Peanut
5 ID0003 Butter
6 ID0004 Storm
7 ID0005 Storm
8 ID0005 Wind
9 ID0005 Butter

Data Cleaning in R

I have a csv file which I want to extract only the timestamp of the sentences which contain toward plus the fruit name in that sentence. How can I do this in R (or if there's a faster way to do so, what's that?)
rosbagTimestamp,data
1438293900729698553,robot is in motion toward [strawberry]
1438293900730571638,Found a plan for avocado in 1.36400008202 seconds
1438293900731434815,current probability is greater than EXECUTION_THRESHOLD
1438293900731554567,ready to execute am original plan of len = 33
1438293900731586463,len of sub plan 1 = 24
1438293900731633713,len of sub plan 2 = 9
1438293900732910799,put in an execution request; now updating the dict
1438293900732949576,current_prediciton_item = avocado
1438293900733070339,current_item_probability = 0.880086981207
1438293901677787230,current probability is greater than PLANNING_THRESHOLD
1438293901681590725,robot is in motion toward [avocado]
1438293902689233770,we have received verbal request [avocado]
1438293902689314002,we already have a plan for the verbal request
1438293902689377800,debug
1438293902690529516,put in the final motion request
1438293902691076051,Found a plan for avocado in 1.95595788956 seconds
1438293902691084147,current predicted item != motion target; calc a new plan
1438293902691110642,current probability is greater than EXECUTION_THRESHOLD
1438293902691885974,have existing requests
1438293904496769068,robot is in motion toward [avocado]
1438293907737142498,ready to pick up the item
Ideally I want the output to be something like this:
1438293900729698553, strawberry
1438293901681590725, avocado
1438293904496769068, avocado
So apparently I have to use subset in grep for R but I am not really sure how to!

stamps <- df$rosbagTimestamp[grep("toward \\[", df$data)]
fruits <- gsub(".*\\[(\\w+)\\].*", "\\1", df$data[grep("toward \\[", df$data)])
data.frame(stamps,fruits)
stamps fruits
1 1438293900729698560 strawberry
2 1438293901681590784 avocado
3 1438293904496769024 avocado
I used the pattern "toward \\[" to locate fruits. If any changes occur in variability, it can be extended. The stamps variable is created by locating time stamps that have the pattern in the data column. The fruits variable isolates the fruit inside of the brackets.

What is the simplest way to recode a variable based on conditions of another variable in R?

Silly example df, "cat":
species color tail_length
calico brown 6
calico gray 6
tabby multi 5
tabby brown 5
Suppose I want to create a new variable, personality. The values here will be recoded based on tail_length, but will also be conditional upon the species and color of the cat. So the ideal final df would look like this:
species color tail_length personality
calico brown 6 mean
calico gray 6 nice
tabby multi 5 mean
tabby brown 5 nice
At present, I'm using the codes:
library(car)
cat$personality<-recode(cat$tail_length, "'6'==mean, '5'==nice")
cat$personality[cat$species=="calico" & cat$color=="brown"] <- mean
cat$personality[cat$species=="calico" & cat$color=="gray"] <- nice
cat$personality[cat$species=="tabby" & cat$color=="multi"]<- mean
cat$personality[cat$species=="tabby" & cat$color=="brown"]<-nice
My main question is this: is there a simpler way to do this/consolidate these functions into one?
Given that I made up this example data on the fly, please take it with a grain of salt when answering.
Thanks! As an R beginner, I really appreciate your help.

Here's one approach using qdap and qdapTools (CRAN packages that I maintain):
library(qdap); library(qdapTools)
key <- list(
mean = c( "calico.gray", "tabby.brown"),
nice = c("calico.brown", "tabby.multi")
)
dat[["personality"]] <- paste2(dat[1:2]) %l% key
dat
## species color tail_length personality
## 1 calico brown 6 nice
## 2 calico gray 6 mean
## 3 tabby multi 5 nice
## 4 tabby brown 5 mean
Basically you create a key that's a named list based on the combined columns. Then %l% acts as a hash table lookup.

There isn't much you can do here because, at the end of the day, you still need to specify the conditions and new variables to assign.
However, you can cut down on the boilerplate code by using within:
within(cat, {
personality <- recode(tail_length, "'6'==mean, '5'==nice")
personality[species == "calico" & color == "brown"] <- "mean"
personality[species=="calico" & color=="gray"] <- "nice"
personality[species=="tabby" & color=="multi"] <- "mean"
personality[species=="tabby" & color=="brown"] <- "nice"
})

This is really just a merge operation. (Furthermore, you have over specified the criteria since species and tail_length are completely dependent. But since it's only an example that may not be an issue.) Let's say your first dataframe is dat and the criteria dataframe is lookup. Then all you need to do is:
> merge(dat, lookup)
species color tail_length personality
1 calico brown 6 mean
2 calico gray 6 nice
3 tabby brown 5 nice
4 tabby multi 5 mean
Not a very interesting or dramatic result because it looks just like the lookup dataframe, but give it something a bit larger and:
> merge( rbind(dat,dat,dat) , lookup)
species color tail_length personality
1 calico brown 6 mean
2 calico brown 6 mean
3 calico brown 6 mean
4 calico gray 6 nice
5 calico gray 6 nice
6 calico gray 6 nice
7 tabby brown 5 nice
8 tabby brown 5 nice
9 tabby brown 5 nice
10 tabby multi 5 mean
11 tabby multi 5 mean
12 tabby multi 5 mean

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Trouble reading a messy CSV file in R - r

Related

Making new data frame in R

R | update column in dataframe based on conditions in other dataframe

R - Converting table to a list for export to an csv line by line for each cell

Data Cleaning in R

What is the simplest way to recode a variable based on conditions of another variable in R?

Categories

Resources