Read CSV in R with first column as dataframe header

Read CSV in R with first column as dataframe header - r

I have a simply text file where the first column is names (strings) and the second column is values (floats). As an example, names and ages:
Name, Age
John, 32
Heather, 46,
Jake, 23
Sally, 19
I'd like to read this in as a dataframe (call this df) but transposed so that I can access ages by names such that df$John would return 32. How can I do this?
Previous I tried creating a new dataframe, tdf, looping through the data in a for loop, assigning each name and age and then inserting into the empty dataframe as tdf[name] = age but this did not work as I expected.

You can read your data using read.table().
Then you can transpose it using t() and set colnames after.
Example:
If df is:
df=read.table("dummydata", header=T, sep=",")
df
Name Age
1 John 32
2 Heather 46
3 Jake 23
4 Sally 19
You transpose the age and then transform them into a dataframe:
tdf=as.data.frame(t(df$Age))
colnames(tdf)=t(df$Name)
So tdf will return:
tdf
John Heather Jake Sally
1 32 46 23 19
And, as you asked, tdf$John will return:
tdf$John
[1] 32
Now, if you have more than two columns you can do the same but instead of indicating the name of the column you can simply indicate the position using brackets.
df=read.table("dummydata", header=T, sep=",")
With t(df[2:ncol(df)]) you transpose the whole table starting from the second column, no matter the number of columns. The first column will be the names after the transpose.
tdf=as.data.frame(t(df[2:ncol(df)]))
Then you set the columnames.
colnames(tdf)=t(df[1])
tdf$John
[1] 32

This should add the the row as header when you read from the file
read.csv2(filename, as.is = TRUE, header = TRUE)

Read the data into a data frame, DF (see Note).
1) Assign the names to the rows of DF in which case this will give John's age without having to create a new data structure:
rownames(DF) <- DF$Name
DF["John", "Age"]
## [1] 32
2) Alternatively, split DF into a named list in which case you can get the precise syntax requested:
ages <- with(DF, split(Age, Name))
ages$John
## [1] 32
3) This alternative would also create the same list:
ages <- with(DF, setNames(as.list(Age), Name))
Note: DF in reproducible form is as follows. (We have removed the trailing comma on one line in the question but if it is really there add fill = TRUE to the read.csv line.)
Lines <- "Name, Age
John, 32
Heather, 46
Jake, 23
Sally, 19"
DF <- read.csv(text = Lines)

A bit late but hopefully helpful. The "row.names" parameter allows you to select the desired column as header:
read.csv("df.csv", header = TRUE, row.names = 1)

Related

Concatenate columns in data frame

We have brands data in a column/variable which is delimited by semicolon(;). Our task is to split these column data to multiple columns which we were able to do with the following syntax.
Attached the data as Screen shot.
Data set
Here is the R code:
x<-dataset$Pref_All
point<-df %>% separate(x, c("Pref_01","Pref_02","Pref_03","Pref_04","Pref_05"), ";")
point[is.na(point)] <- ""
However our question is: We have this type of brands data in more than 10 to 15 columns and if we use the above syntax the maximum number of columns to be split is to be decided on the number of brands each column holds (which we manually calculated and taken as 5 columns).
We would like to know is there any way where we can write the code in a dynamic way such that it should calculate the maximum number of brands each column holds and accordingly it should create those many new columns in a data frame. for e.g.
Pref_01,Pref_02,Pref_03,Pref_04,Pref_05.
the preferred output is given as a screen shot.
Output
Thanks for the help in advance.

x <- c("Swift;Baleno;Ciaz;Scross;Brezza", "Baleno;swift;celerio;ignis", "Scross;Baleno;celerio;brezza", "", "Ciaz;Scross;Brezza")
strsplit(x,";")

library(dplyr)
library(tidyr)
x <- data.frame(ID = c(1,2,3,4,5),
Pref_All = c("S;B;C;S;B",
"B;S;C;I",
"S;B;C;B",
" ",
"C;S;B"))
x$Pref_All <- as.character(levels(x$Pref_All))[x$Pref_All]
final_df <- x %>%
tidyr::separate(Pref_All, c(paste0("Pref_0", 1:b[[which.max(b)]])), ";")
final_df$ID <- x$Pref_All
final_df <- rename(final_df, Pref_All = ID)
final_df[is.na(final_df)] <- ""
Pref_All Pref_01 Pref_02 Pref_03 Pref_04 Pref_05
1 S;B;C;S;B S B C S B
2 B;S;C;I B S C I
3 S;B;C;B S B C B
4
5 C;S;B C S B
The trick for the column names is given by paste0 going from 1 to the maximum number of brands in your data!

I would use str_split() which returns a list of character vectors. From that, we can work out the max number of preferences in the dataframe and then apply over it a function to add the missing elements.
df=data.frame("id"=1:5,
"Pref_All"=c("brand1", "brand1;brand2;brand3", "", "brand2;brand4", "brand5"))
spl = str_split(df$Pref_All, ";")
# Find the max number of preferences
maxl = max(unlist(lapply(spl, length)))
# Add missing values to each element of the list
spl = lapply(spl, function(x){c(x, rep("", maxl-length(x)))})
# Bind each element of the list in a data.frame
dfr = data.frame(do.call(rbind, spl))
# Rename the columns
names(dfr) = paste0("Pref_", 1:maxl)
print(dfr)
# Pref_1 Pref_2 Pref_3
#1 brand1
#2 brand1 brand2 brand3
#3
#4 brand2 brand4
#5 brand5

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.

Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"

If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"

You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

Parsing a string efficiently

So I've got a column in my data frame that is essentially one long characteristic string that is used to encode about variables for each record. It might look something like this:
string<-c('001034002025003996','001934002199004888')
But much longer.
The strings are structured so each 6 characters are paired together. So you can look at the string above like this:
001034 002025 003996
001934 002199 004888
The first three characters of these is a code corresponding to a certain variable and the next three correspond to the value of that variable. So the above can be broken down into three columns that look like this:
var001 var002 var003 var004
1 034 025 996 NA
2 934 199 NA 888
I need a way to parse this string and return a data frame with the expanded columns.
I wrote a nested loop that looks like this:
for(i in 1:length(string)){
text <- string[i]
for(j in seq(1,505,6)){
var <- substr(text,j, j+2)
var.value <- substr(text, j+3, j+5)
index <- (as.numeric(var))
df[i, index] <- var.value
}
}
where df is an empty data frame created to receive the data. This works, but is slow on larger amounts of data. Is there a better way to do this?

1) This one-liner produces a character matrix (which can easily be converted to a data.frame if need be). No packages are used.
read.dcf(textConnection(gsub("(...)(...)", "\\1: \\2\n", string)))
giving:
001 002 003 004
[1,] "034" "025" "996" NA
[2,] "934" "199" NA "888"
2) This alternative produces the same matrix. The read.table produces a long form data.frame and then tapply reshapes it to a wide matrix.
long <- read.table(text = gsub("(...)(...)", "\\1 \\2\n", string),
colClasses = "character", col.names = c("id", "var"))
tapply(long$var, list(gl(length(string), nchar(string[1])/6), long$id), c)

Add a Column Name that is the File name as a year to a dataframe

I am new to R. I have multiple files in a directory on my local pc. I have imported them to R and added column names as below. Now I need to add the year to each data frame which corresponds to the file name. For example the first file is called 1950 the 2nd 1951 and so on. How do I add the year as a column name with these values in R?
The output is below
Name Sex Number
1 Linda F 10
2 Mary F 100
3 Patrick M 200
4 Barbara F 300
5 Susan F 500
6 Richard M 900
7 Deborah F 500
8 Sandra F 23
9 Conor M 15
10 Conor F 120
I need another column at the start that is the year for this file?
This is my code to generate the above.
ldf <- list() # creates a list
listtxt <- dir(pattern = "*.txt") # creates the list of all the txt files in the directory
#Year = 1950
for (k in 1:length(listtxt)) #1:4 4 is the length of the list
{
ldf[[k]] <- read.table(listtxt[k],header=F,sep=",")
colnames(ldf[[k]]) = c('Name', 'Sex', 'Number')
#test = cbind(ldf[[k]], Year )
}
I need the year to increase by 1 for each file and to add it as a column with the value?
Any help would be greatly appreciated.

You can add a column with the year by getting the year directly from the file name. I've also used lapply instead of a loop to cycle through each of the files.
In the code below, the function reads a single file and also adds a column with the year of that file. Since your file names have the year in the name, you just get the year from the file name using substr. lapply applies the function to every file name in listtxt, resulting in a list where each element is a data frame. Then you just rbind all of the list elements into a single data frame.
ldf = lapply(listtxt, function(x) {
dat = read.table(x, header=FALSE, sep=",")
# Add column names
names(dat) = c('Name', 'Sex', 'Number')
# Add a column with the year
dat$Year = substr(x,1,4)
return(dat)
})
# Combine all the individual data frames into a single data frame
df = do.call("rbind", ldf)
Instead of do.call("rbind", ldf) you can also use rbind_all from the dplyr package, as follows:
library(dplyr)
df = rbind_all(ldf)

I couldn't add as a comment to #eipi10 answer above, so I'll have to do it here. I just tried this and it worked perfectly (thanks - I'd search for hours with no luck) but got message that rbind_all is deprecated. the dplyr solution is now:
library(dplyr)
df = bind_rows(ldf)

Remove an entire column from a data.frame in R

Does anyone know how to remove an entire column from a data.frame in R? For example if I am given this data.frame:
> head(data)
chr genome region
1 chr1 hg19_refGene CDS
2 chr1 hg19_refGene exon
3 chr1 hg19_refGene CDS
4 chr1 hg19_refGene exon
5 chr1 hg19_refGene CDS
6 chr1 hg19_refGene exon
and I want to remove the 2nd column.

You can set it to NULL.
> Data$genome <- NULL
> head(Data)
chr region
1 chr1 CDS
2 chr1 exon
3 chr1 CDS
4 chr1 exon
5 chr1 CDS
6 chr1 exon
As pointed out in the comments, here are some other possibilities:
Data[2] <- NULL # Wojciech Sobala
Data[[2]] <- NULL # same as above
Data <- Data[,-2] # Ian Fellows
Data <- Data[-2] # same as above
You can remove multiple columns via:
Data[1:2] <- list(NULL) # Marek
Data[1:2] <- NULL # does not work!
Be careful with matrix-subsetting though, as you can end up with a vector:
Data <- Data[,-(2:3)] # vector
Data <- Data[,-(2:3),drop=FALSE] # still a data.frame

To remove one or more columns by name, when the column names are known (as opposed to being determined at run-time), I like the subset() syntax. E.g. for the data-frame
df <- data.frame(a=1:3, d=2:4, c=3:5, b=4:6)
to remove just the a column you could do
Data <- subset( Data, select = -a )
and to remove the b and d columns you could do
Data <- subset( Data, select = -c(d, b ) )
You can remove all columns between d and b with:
Data <- subset( Data, select = -c( d : b )
As I said above, this syntax works only when the column names are known. It won't work when say the column names are determined programmatically (i.e. assigned to a variable). I'll reproduce this Warning from the ?subset documentation:
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like '[', and in particular the non-standard evaluation
of argument 'subset' can have unanticipated consequences.

(For completeness) If you want to remove columns by name, you can do this:
cols.dont.want <- "genome"
cols.dont.want <- c("genome", "region") # if you want to remove multiple columns
data <- data[, ! names(data) %in% cols.dont.want, drop = F]
Including drop = F ensures that the result will still be a data.frame even if only one column remains.

The posted answers are very good when working with data.frames. However, these tasks can be pretty inefficient from a memory perspective. With large data, removing a column can take an unusually long amount of time and/or fail due to out of memory errors. Package data.table helps address this problem with the := operator:
library(data.table)
> dt <- data.table(a = 1, b = 1, c = 1)
> dt[,a:=NULL]
b c
[1,] 1 1
I should put together a bigger example to show the differences. I'll update this answer at some point with that.

There are several options for removing one or more columns with dplyr::select() and some helper functions. The helper functions can be useful because some do not require naming all the specific columns to be dropped. Note that to drop columns using select() you need to use a leading - to negate the column names.
Using the dplyr::starwars sample data for some variety in column names:
library(dplyr)
starwars %>%
select(-height) %>% # a specific column name
select(-one_of('mass', 'films')) %>% # any columns named in one_of()
select(-(name:hair_color)) %>% # the range of columns from 'name' to 'hair_color'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^v.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
You can also drop by column number:
starwars %>%
select(-2, -(4:10)) # column 2 and columns 4 through 10

With this you can remove the column and store variable into another variable.
df = subset(data, select = -c(genome) )

Using dplyR, the following works:
data <- select(data, -genome)
as per documentation found here https://www.marsja.se/how-to-remove-a-column-in-r-using-dplyr-by-name-and-index/#:~:text=select(starwars%2C%20%2Dheight)

I just thought I'd add one in that wasn't mentioned yet. It's simple but also interesting because in all my perusing of the internet I did not see it, even though the highly related %in% appears in many places.
df <- df[ , -which(names(df) == 'removeCol')]
Also, I didn't see anyone post grep alternatives. These can be very handy for removing multiple columns that match a pattern.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Read CSV in R with first column as dataframe header - r

This should add the the row as header when you read from the file read.csv2(filename, as.is = TRUE, header = TRUE)

A bit late but hopefully helpful. The "row.names" parameter allows you to select the desired column as header: read.csv("df.csv", header = TRUE, row.names = 1)

Related

Concatenate columns in data frame

What's the best way to add a specific string to all column names in a dataframe in R?

Parsing a string efficiently

Add a Column Name that is the File name as a year to a dataframe

Remove an entire column from a data.frame in R

Categories

Resources