'perl reading file to array with varying row lengths, "use of uninitialized value"

First post so please forgive if I formatted poorly. This is more of an annoying output problem that prevents easy scanning for true errors than anything. In short I have to break every row of a large file into individual characters, but the length of the row isn't the same so towards the end I get a blast of "use of uninitialized values". The script works fine but again it's hard to see the actual output I need to see what is going on and which lines it chooses not to use. Details and relevant script below.

use warnings;
use strict;

Maybe if I didn't use these my problem would go away but I'd like to keep it together.

I have a script that is made to manipulate a pdb file which need exact column number of values. My script is in perl and takes in 10-50K files. It first breaks every line into individual characters (stored in array @line) and stores a specific number of characters into an array called @column. It then checks the first string of 6 characters and throws out any that don't match 1 of three specific strings. Also it changes column 22 into the letter "A". Finally storing everything into another file. The lines range from 3 characters to 80 so for each array element that is NULL from the file being blank throws this error when not all 80 characters are present. I saw a post with similar problem but they were doing a csv file which I can't use as explained. I can't just detect for spaces wither because as you can see in the file example below fields bleed into each other so it has to be column specific.

read-in section:

while (my $row = <FH>) {
chomp $row;
$row =~ s/^\s+//;
@line = split(//, $row);
$column[0] = join ('', @line[0..5]);
$column[1] = join ('', @line[6..10]);
$column[2] = join ('', @line[11..15]);
$column[3] = join ('', $line[16]);
$column[4] = join ('', @line[17..19]);
$column[5] = join ('', $line[20]);
$column[6] = join ('', $line[21]);   # Chain ID
$column[7] = join ('', @line[22..25]);  #residue number
$column[8] = join ('', $line[26]);
$column[9] = join ('', @line[27..37]);
$column[10] = join ('', @line[38..45]);
$column[11] = join ('', @line[46..53]);
$column[12] = join ('', @line[54..59]);
$column[13] = join ('', @line[60..65]);
$column[14] = join ('', @line[66..75]);
$column[15] = join ('', @line[76..77]);

this error is present for about 60+ lines for short rows:

no match
Use of uninitialized value in join or string at ./change_chain_ID_to_A.pl line 35, <FH> line 33828.
Use of uninitialized value in join or string at ./change_chain_ID_to_A.pl line 35, <FH> line 33828.
Use of uninitialized value in join or string at ./change_chain_ID_to_A.pl line 35, <FH> line 33828.
Use of uninitialized value in join or string at ./change_chain_ID_to_A.pl line 36, <FH> line 33828.
Use of uninitialized value in join or string at ./change_chain_ID_to_A.pl line 36, <FH> line 33828.
Use of uninitialized value in join or string at ./change_chain_ID_to_A.pl line 36, <FH> line 33828.

etc...etc

example lines from input file

HETATM33701 CA   CA  I2111      20.810  32.443 -53.618  1.00  0.00          Ca
HETATM33702 CA   CA  I2112      -7.146  39.054 -51.559  1.00  0.00          Ca
CONECT 3502 3501 4093
CONECT 4093 3502 4092
CONECT119241192312515
CONECT125151192412514
CONECT203462034520937


Solution 1:[1]

Following code is provided for educational purpose only with an accent on parsing fixed length structured data records.

OP did not provide enough information to make a suggestion to direct OP in right direction.

The correct approach is to use CPAN genetics module which was designed specifically for such job/purpose.

Demo code demonstrates usage of unpack function to extract data structure.

use strict;
use warnings;
use feature 'say';

use Data::Dumper;
use YAML;

my($data,$model,$index);

while( <DATA> ) {
    chomp;
    next if /^ENDMDL|^\s+\z/;
    my $item;
    $index = 0                  if /^MODEL/;
    $model = parse_model($_)    if /^MODEL/;
    $item  = parse_atom($_)     if /^ATOM/;
    $item  = parse_hetatm($_)   if /^HETATM/;
    $item  = parse_ter($_)      if /^TER/;
    $data->{$model->{serial}}[$index++] = $item if defined $item;
}

#say Dumper($data);
say Dump($data);

sub parse_model {
    my $line = shift;
    my $model;
    
    my @fields = qw/record_name serial/;
    
    $model->@{@fields} = unpack('a6x4a4');
    defined($model->{$_}) && $model->{$_} =~ s/^\s+|\s+\z//g for @fields;
    
    return $model;
}

sub parse_atom {
    my $line = shift;
    my $atom;

    my @fields = qw/record_name serial name altLoc resName chainID resSeq iCode x y z occupancy tempFactor element charge/;
    
    $atom->@{@fields} = unpack('a6a5xa4aa4aa8a8a8a8a6a6x11a2a2',$line);
    defined($atom->{$_}) && $atom->{$_} =~ s/^\s+|\s+\z//g for @fields;
    
    return $atom;
}

sub parse_hetatm {
    my $line = shift;
    my $hetatm;

    my @fields = qw/record_name serial name altLoc resName chainID resSeq iCode x y z occupancy tempFactor element charge/;

    $hetatm->@{@fields} = unpack('a6a5xa4aa4aa8a8a8a8a6a6x11a2a2',$line);
    defined($hetatm->{$_}) && $hetatm->{$_} =~ s/^\s+|\s+\z//g for @fields;
    
    return $hetatm;
}

sub parse_ter {
    my $line = shift;
    my $ter;
    
    my @fields = qw/record_name serial resName chainID resSeq iCode/;
    
    $ter->@{@fields} = unpack('a6a5x6a3aa4a',$line);
    defined($ter->{$_}) && $ter->{$_} =~ s/^\s+|\s+\z//g for @fields;
    
    return $ter;
}

__DATA__
         1         2         3         4         5         6         7         8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
MODEL        1
ATOM      1  N   ALA A   1      11.104   6.134  -6.504  1.00  0.00           N
ATOM      2  CA  ALA A   1      11.639   6.071  -5.147  1.00  0.00           C
HETATM 3835 FE   HEM A   1      17.140   3.115  15.066  1.00 14.14          FE
HETATM 8238  S   SO4 A2001      10.885 -15.746 -14.404  1.00 47.84           S  
HETATM 8239  O1  SO4 A2001      11.191 -14.833 -15.531  1.00 50.12           O  
HETATM 8240  O2  SO4 A2001       9.576 -16.338 -14.706  1.00 48.55           O  
HETATM 8241  O3  SO4 A2001      11.995 -16.703 -14.431  1.00 49.88           O  
HETATM 8242  O4  SO4 A2001      10.932 -15.073 -13.100  1.00 49.91           O
ATOM    293 1HG  GLU A   18    -14.861  -4.847   0.361  1.00  0.00           H
ATOM    294 2HG  GLU A   18    -13.518  -3.769   0.084  1.00  0.00           H
TER     295      GLU A   18                                           
ENDMDL                                                              
MODEL        2                                                       
ATOM    296  N   ALA  A   1     10.883   6.779  -6.464  1.00  0.00           N
ATOM    297  CA  ALA  A   1     11.451   6.531  -5.142  1.00  0.00           C
HETATM 3835 FE   HEM A   1      17.140   3.115  15.066  1.00 14.14          FE
HETATM 8238  S   SO4 A2001      10.885 -15.746 -14.404  1.00 47.84           S  
HETATM 8239  O1  SO4 A2001      11.191 -14.833 -15.531  1.00 50.12           O  
HETATM 8240  O2  SO4 A2001       9.576 -16.338 -14.706  1.00 48.55           O  
HETATM 8241  O3  SO4 A2001      11.995 -16.703 -14.431  1.00 49.88           O  
HETATM 8242  O4  SO4 A2001      10.932 -15.073 -13.100  1.00 49.91           O
ATOM    588 1HG  GLU A   18    -13.363  -4.163  -2.372  1.00  0.00           H
ATOM    589 2HG  GLU A   18    -12.634  -3.023  -3.475  1.00  0.00           H
TER     590      GLU A   18                                          
ENDMDL                                                              

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1